[PDF] Massive MIMO Systems with Non-Ideal Hardware: Energy Efficiency, Estimation, and Capacity Limits

Abstract

The use of large-scale antenna arrays can bring substantial improvements in energy and/or spectral efficiency to wireless systems due to the greatly improved spatial resolution and array gain. Recent works in the field of massive multiple-input multiple-output (MIMO) show that the user channels decorrelate when the number of antennas at the base stations (BSs) increases, thus strong signal gains are achievable with little inter-user interference. Since these results rely on asymptotics, it is important to investigate whether the conventional system models are reasonable in this asymptotic regime. This paper considers a new system model that incorporates general transceiver hardware impairments at both the BSs (equipped with large antenna arrays) and the single-antenna user equipments (UEs). As opposed to the conventional case of ideal hardware, we show that hardware impairments create finite ceilings on the channel estimation accuracy and on the downlink/uplink capacity of each UE. Surprisingly, the capacity is mainly limited by the hardware at the UE, while the impact of impairments in the large-scale arrays vanishes asymptotically and inter-user interference (in particular, pilot contamination) becomes negligible. Furthermore, we prove that the huge degrees of freedom offered by massive MIMO can be used to reduce the transmit power and/or to tolerate larger hardware impairments, which allows for the use of inexpensive and energy-efficient antenna elements.

Full PDF

11 Massive MIMO Systems with Non-Ideal Hardware:Energy Efﬁciency, Estimation, and Capacity Limits

Emil Bj¨ornson,

Member, IEEE,

Jakob Hoydis,

Member, IEEE,

Marios Kountouris,

Member, IEEE, and M´erouane Debbah,

Senior Member, IEEE

Abstract —The use of large-scale antenna arrays can bringsubstantial improvements in energy and/or spectral efﬁciency towireless systems due to the greatly improved spatial resolutionand array gain. Recent works in the ﬁeld of massive multiple-input multiple-output (MIMO) show that the user channelsdecorrelate when the number of antennas at the base stations(BSs) increases, thus strong signal gains are achievable with littleinter-user interference. Since these results rely on asymptotics,it is important to investigate whether the conventional systemmodels are reasonable in this asymptotic regime. This paper con-siders a new system model that incorporates general transceiverhardware impairments at both the BSs (equipped with largeantenna arrays) and the single-antenna user equipments (UEs).As opposed to the conventional case of ideal hardware, we showthat hardware impairments create ﬁnite ceilings on the channelestimation accuracy and on the downlink/uplink capacity of eachUE. Surprisingly, the capacity is mainly limited by the hardwareat the UE, while the impact of impairments in the large-scalearrays vanishes asymptotically and inter-user interference (inparticular, pilot contamination) becomes negligible. Furthermore,we prove that the huge degrees of freedom offered by massiveMIMO can be used to reduce the transmit power and/or totolerate larger hardware impairments, which allows for the useof inexpensive and energy-efﬁcient antenna elements.

Index Terms —Capacity bounds, channel estimation, energyefﬁciency, massive MIMO, pilot contamination, time-divisionduplex, transceiver hardware impairments.

I. I

NTRODUCTION

The spectral efﬁciency of a wireless link is limited by theinformation-theoretic capacity [2], which depends not only on

Copyright (c) 2014 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected] Matlab code that reproduces all simulation results is available online,see https://github.com/emilbjornson/massive-MIMO-hardware-impairments/E. Bj¨ornson was with the Alcatel-Lucent Chair on Flexible Radio, Sup´elec,Gif-sur-Yvette, France, and with the Department of Signal Processing, KTHRoyal Institute of Technology, Stockholm, Sweden. He is currently with theDepartment of Electrical Engineering (ISY), Link¨oping University, Sweden(email: [email protected]).J. Hoydis was with Bell Laboratories, Alcatel-Lucent, Germany. He is nowwith Spraed SAS, Orsay, France (email: [email protected]).M. Kountouris and M. Debbah are SUPELEC, Gif-sur-Yvette, France (e-mail: [email protected], [email protected]).This paper was presented in part at the International Conference on DigitalSignal Processing (DSP), Santorini, Greece, July 2013 [1].The work of E. Bj¨ornson was funded by the International Postdoc Grant2012-228 from The Swedish Research Council. This research has beensupported by the ERC Starting Grant 305123 MORE (Advanced MathematicalTools for Complex Network Engineering). Parts of this work have beenperformed in the framework of the FP7 project ICT-317669 METIS. Thiswork was supported by the Future and Emerging Technologies (FET) projectHIATUS within the Seventh Framework Programme for Research of theEuropean Commission under FET-Open grant number 265578. the signal-to-noise ratio (SNR) but also on spatial correlationin the propagation environment [3], [4], channel estimationaccuracy [5], transceiver hardware impairments [6], [7], andsignal processing resources [8], [9]. It is of profound impor-tance to increase the spectral efﬁciency of future networks,to keep up with the increasing demand for wireless services.However, this is a challenging task and usually comes at theprice of having stricter hardware and overhead requirements.A new network architecture has recently been proposedwith the remarkable potential of both increasing the spectralefﬁciency and relaxing the aforementioned implementationissues. It is known as massive MIMO, or large-scale MIMO,and is based on having a very large number of antennas ateach BS and exploiting channel reciprocity in time-divisionduplex (TDD) mode [9]–[13]. Some key features are: 1)propagation losses are mitigated by a large array gain dueto coherent beamforming/combining; 2) interference-leakagedue to channel estimation errors vanish asymptotically inthe large-dimensional vector space; 3) low-complexity signalprocessing algorithms are asymptotically optimal; and 4) inter-user interference is easily mitigated by the high beamformingresolution.The amount of research on massive MIMO increasesrapidly, but the impact of transceiver hardware impairmentson these systems has received little attention so far—althoughlarge arrays might only be attractive for network deploymentif each antenna element consists of inexpensive hardware.Cheap hardware components are particularly prone to theimpairments that exist in any transceiver (e.g., ampliﬁer non-linearities, I/Q-imbalance, phase noise, and quantization errors[14]–[23]). The inﬂuence of hardware impairments is usuallymitigated by compensation algorithms [14], which can be im-plemented by analog and digital signal processing. These tech-niques cannot remove the impairments completely, but thereremain residual impairments since the time-varying hardwarecharacteristics cannot be fully parameterized and estimated,and because there is randomness induced by different types ofnoise. Transceiver impairments are known to fundamentallylimit the capacity in the high-power regime [6], [24], whilethere are only a few publications that analyze the behaviorin the large number of antenna regime. Lower bounds on theachievable uplink sum rate in massive single-cell systems withphase noise from free-running oscillators were derived in [25].The impact of ampliﬁer non-linearities in a transmitter can bereduced by having a low peak-to-average power ratio (PAPR).The excess degrees of freedom offered by massive MIMOwere used in [26] to optimize the downlink precoding for low a r X i v : . [ c s . I T ] S e p D o w n li n k : P il o t s & D a t a U p li n k : P il o t s & D a t a Base station User equipment

Fig. 1: Illustration of the reciprocal channel between a BSequipped with a large antenna array and a single-antenna UE.PAPR, while [27] considered a constant-envelope precodingscheme designed for very low PAPR.This paper analyzes the aggregate impact of different hard-ware impairments on systems with large antenna arrays, incontrast to the ideal hardware considered in [10]–[13] andthe single type of impairments considered in [25]–[27]. Weassume that appropriate compensation algorithms have beenapplied and focus on the residual hardware impairments.Motivated by the analytic analysis and experimental results in[14]–[18], the residual hardware impairments at the transmitterand receiver are modeled as additive distortion noises withcertain important properties. The system model with hardwareimpairments is deﬁned and motivated in Section II. Section IIIderives a new pilot-based channel estimator and shows that theestimation accuracy is limited by the levels of impairments.The focus of Section IV is on a single link in the systemwhere we derive lower and upper bounds on the downlinkand uplink capacities. Our results reveal the existence of ﬁnitecapacity ceilings due to hardware impairments. Despite thesediscouraging results, Section V shows that a high energyefﬁciency and resilience towards hardware impairments at theBS can be achieved. Section VI puts these results in a multi-cell context and shows that inter-user interference (includingpilot contamination) basically drowns in the distortion noisefrom hardware impairments. Section VII describes the impactof various reﬁnements of the system model, while Section VIIIsummarizes the contributions and insights of the paper.To encourage reproducibility and extensions to this paper,all the simulation results can be generated by the Matlab codethat is available at https://github.com/emilbjornson/massive-MIMO-hardware-impairments/

Notation:

Boldface (lower case) is used for column vectors, x , and (upper case) for matrices, X . Let X T , X ∗ , and X H denote the transpose, conjugate, and conjugate transpose of X , respectively. X (cid:23) X means that X − X is positivesemi-deﬁnite. A diagonal matrix with a , . . . , a N on the maindiagonal is denoted diag( a , . . . , a N ) and I denotes an identitymatrix (of appropriate dimensions). The Frobenius and spectralnorms of a matrix X are denoted by (cid:107) X (cid:107) F and (cid:107) X (cid:107) ,respectively, while (cid:107) x (cid:107) k denotes the L k norm of a vector x .A stochastic variable x and its realization is denoted in thesame way, for brevity. The expectation operator with respectto a stochastic variable x is denoted E { x } , while E { x | y } is the conditional expectation when y is given. A Gaussian stochastic variable x is denoted x ∼ N (¯ x, q ) , where ¯ x is themean and q is the variance. A circularly symmetric complexGaussian stochastic vector x is denoted x ∼ CN (¯ x , Q ) , where ¯ x is the mean and Q is the covariance matrix. The empty setis denoted by ∅ . The big O notation f ( x ) = O ( g ( x )) meansthat (cid:12)(cid:12)(cid:12) f ( x ) g ( x ) (cid:12)(cid:12)(cid:12) is bounded as x → ∞ .II. C HANNEL AND S YSTEM M ODEL

For analytical clarity, the major part of this paper analyzesthe fundamental spectral and energy efﬁciency limits of asingle link, which operates under arbitrary interference condi-tions. The link is established between an N -antenna BS and asingle-antenna UE. A main characteristic in the analysis is thatthe number of antennas N can be very large. We consider aTDD protocol that toggles between uplink (UL) and downlink(DL) transmission on the same ﬂat-fading subcarrier. Thisenables efﬁcient channel estimation even when N is large,because the estimation accuracy and overhead in the UL isindependent of N [9]. The acquired instantaneous channelstate information (CSI) is utilized for UL data detection as wellas DL data transmission, by exploiting channel reciprocity; see Fig. 1. In Section VI, we put our results in a multi-cell context with many users, inter-cell interference, and pilotcontamination.We assume a block fading structure where each channelis static for a coherence period of T coher channel uses.The channel realizations are generated randomly and areindependent between blocks. For simplicity, T coher is thesame for the useful channel and any interfering channels,and the coherence periods are synchronized. We consider theconventional TDD protocol in Fig. 2, which can be foundin many previous works; see for example [28] and [29].Each block begins with UL pilot/control signaling for T ULpilot channel uses, followed by UL data transmission for T ULdata channel uses. Next, the system toggles to the DL. This partbegins with T DLpilot channel uses of DL pilot/control signaling.These pilots are typically used by the UEs to estimate theireffective channel (with precoding) and the current interferenceconditions, which enables coherent DL reception. Note thatthese quantities are scalars irrespective of N , thus the DLpilot signaling need not scale with N . The coherence periodends with DL data transmission for T DLdata channel uses. Thefour parameters satisfy T ULpilot + T ULdata + T DLpilot + T DLdata = T coher .The analysis of this paper is valid for arbitrary ﬁxed values ofthose parameters, but we note that these can also be optimizeddynamically based on T coher , user load, user conditions, ratioof UL/DL trafﬁc, etc.The stochastic block-fading channel between the BS andthe UE is denoted as h ∈ C N × . It is modeled as an ergodicprocess with a ﬁxed independent realization h ∼ CN ( , R ) ineach coherence period. This is known as Rayleigh block fadingand R = E { hh H } ∈ C N × N is the positive semi-deﬁnitecovariance matrix. The statistical distribution is assumed to The physical channels are always reciprocal, but different transceiverchains are typically used in the UL and DL. Careful calibration is thereforenecessary to utilize the reciprocity for transmission; see Section VII-E.

UplinkPilot & ControlSignals Downlink DataTransmissionCoherence PeriodUplink DataTransmission T ULpilot T ULdata T DLpilot T DLdata

DownlinkPilot & ControlSignals T coher Fig. 2: Cyclic operation of a block-fading TDD system, wherethe coherence period T coher is divided into phases for UL/DLpilot and data transmission.be known at the BS. In the asymptotic analysis, we make thefollowing technical assumptions: • The spectral norm of R is uniformly bounded, irrespec-tive of the number of antennas N (i.e., (cid:107) R (cid:107) = O (1) ); • The trace of R scales linearly with N (i.e., < lim inf N N tr( R ) ≤ lim sup N N tr( R ) < ∞ ) and R hasstrictly positive diagonal elements.The ﬁrst assumption is a necessary physical property thatoriginates from the law of energy conservation. It is alsoa common enabler for asymptotic analysis (cf. [12]). Thesecond assumption is a typical consequence of increasingthe array size with N and thereby improving the spatialresolution and aperture [9]. These assumptions imply < lim inf N N rank( R ) ≤ , which means that R can be rankdeﬁcient but the rank increases with N such that cN ≤ rank( R ) ≤ N for some c > . We stress that R isgenerally not a scaled identity matrix, but describes the spatialpropagation environment and array geometry. It might berank-deﬁcient (e.g., have a large conditional number) for largearrays due to insufﬁcient richness of the scattering [3], [4]. A. Transceiver Hardware Impairments

The majority of papers on massive MIMO systems considerschannels with ideal transceiver hardware. However, practicaltransceivers suffer from hardware impairments that 1) createa mismatch between the intended transmit signal and what isactually generated and emitted; and 2) distort the received sig-nal in the reception processing. In this paper, we analyze howthese impairments impact the performance and key asymptoticproperties of massive MIMO systems.Physical transceiver implementations consist of many differ-ent hardware components (e.g., ampliﬁers, converters, mixers,ﬁlters, and oscillators [30]) and each one distorts the signalsin its own way. The hardware imperfections are unavoidable,but the severity of the impairments depends on engineeringdecisions—larger distortions can be deliberately introduced todecrease the hardware cost and/or the power consumption [7].The non-ideal behavior of each component can be modeled indetail for the purpose of designing compensation algorithms,but even after compensation there remain residual transceiverimpairments [15], [17]; for example, due to insufﬁcient mod- Although these assumptions make sense for practically large N [4], wecannot physically let N → ∞ since the propagation environment is enclosedby a ﬁnite volume [9]. Nevertheless, our simulations reveal that the asymptoticanalysis enabled by the technical assumptions is accurate at quite small N . eling accuracy, imperfect estimation of model parameters, andtime varying characteristics induced by noise.From a system performance perspective, it is the aggregateeffect of all the residual transceiver impairments that is impor-tant, not the individual behavior of each hardware component.Recently, a new system model has been proposed in [14]–[19] where the aggregate residual hardware impairments aremodeled by independent additive distortion noises at the BS aswell as at the UE. We adopt this model herein due its analyticaltractability and the experimental veriﬁcations in [15]–[17]. Thedetails of the DL and UL system models are given in thenext subsections, and these are then used in Sections III–VI toanalyze different aspects of massive MIMO systems. Possiblemodel reﬁnements are then provided in Section VII, alongwith discussions on how these might impact the main resultsof this paper. B. Downlink System Model

The downlink channel is used for data transmission andpilot-based channel estimation; see Fig. 1. The received DLsignal y ∈ C in a ﬂat-fading multiple-input single-output(MISO) channel is conventionally modeled as y = h T s + n (1)where s ∈ C N × is either a deterministic pilot signal (duringchannel estimation) or a stochastic zero-mean data signal; inany case, the covariance matrix is denoted W = E { ss H } andthe average power is p BS = tr( W ) . W is a design parameterthat might be a function of the channel realization h and therealizations of any other channel in the system (e.g., due toprecoding); we let H denote the set of channel realizationsfor all useful and interfering channels (i.e., h ∈ H ). Hence, W is constant within each coherence period but changesbetween coherence periods since H changes. The additive term n = n noise + n interf is an ergodic stochastic process that con-sists of independent receiver noise n noise ∼ CN (0 , σ ) andinterference n interf from simultaneous transmissions (e.g., toother UEs). The interference has zero mean and is independentof the data signal, but might depend on any channel in the sys-tem (e.g., such that carry interference). Hence, the conditionalinterference variance is E {| n interf | |H} = I UE H ≥ in thecoherence period where the channel realizations are H . Thelong-term interference variance is denoted E { I UE H } . It is onlyfor brevity that we use a common notation n for interferenceand receiver noise—it does not mean that the interference mustbe treated as noise at the UE. A detailed interference modelis provided in Section VI.To model systems with non-ideal hardware more accurately,we consider the new system model from [14]–[19] where thereceived signal at the UE is y = h T ( s + η BS t ) + η UE r + n. (2)The difference from the conventional model in (1) is theadditive distortion noise terms η BS t ∈ C N × and η UE r ∈ C ,which are ergodic stochastic processes that describe the resid-ual transceiver impairments of the transmitter hardware at theBS and the receiver hardware at UE, respectively. We assume that these are independent of the signal s , but depend on thechannel h and thus are stationary only within each coherenceperiod. In particular, we consider the conditional distributions η BS t ∼ CN ( , Υ BS t ) and η UE r ∼ CN (0 , υ UE r ) for given chan-nel realizations H . The Gaussian distributions of η BS t and η UE r have been veriﬁed experimentally (see e.g., [17, Fig. 4.13]) andcan be motivated analytically by the central limit theorem—thedistortion noises describe the aggregate effect of many residualhardware impairments. A key property is that the distortionnoise caused at an antenna is proportional to the signal powerat this antenna (see [15]–[17] for experimental veriﬁcations),thus we have Υ BS t = κ BS t diag( W , . . . , W NN ) (3) υ UE r = κ UE r h T Wh ∗ (4)where W ii is the i th diagonal element of W and κ BS t , κ UE r ≥ are the proportionality coefﬁcients. The intuition is that a ﬁxedportion of the signal is turned into distortion; for example, dueto quantization errors in automatic-gain-controlled analog-to-digital conversion (ADC), inter-carrier interference induced byphase noise, leakage from the mirror subcarrier under I/Q im-balance, and amplitude-amplitude nonlinearities in the powerampliﬁer [14], [21], [31]. The proportionality coefﬁcients aretreated as constants in the analysis, but can generally increasewith the signal power; see Section VII-B for details. Remark 1 (Distortion Noise and EVM) . Distortion noise isan alteration of the useful signal, while the classical receivernoise models random ﬂuctuations in the electronic circuitsat the receiver. A main difference is thus that the distortionnoise power is non-stationary since it is proportional tothe signal power p BS and the current channel gain (cid:107) h (cid:107) .The proportionality coefﬁcients κ BS t and κ UE r characterizethe levels of impairments and are related to the error vectormagnitude (EVM) [15]; for example, the EVM at the BS isdeﬁned as EVM BS t = (cid:115) E {(cid:107) η BS t (cid:107) |H} E {(cid:107) s (cid:107) |H} = (cid:115) tr( Υ BS t )tr( W ) = (cid:113) κ BS t . (5) The EVM is a common quality measure of transceivers andthe 3GPP LTE standard speciﬁes total EVM requirementsin the range [0 . , . , where higher spectral efﬁciencies(modulations) are supported if the EVM is smaller [31,Sec. 14.3.4]. LTE transceivers typically support all the stan-dardized modulations, thus the EVM is below 0.08. LargerEVMs are, however, of interest in massive MIMO systemssince such relaxed hardware constraints enable the use oflow-cost equipment. Therefore, the simulations in this paperconsider κ -parameters in the range [0 , . ] , where smallvalues represent accurate and expensive transceiver hardware. The system model in (2) captures the main characteristicsof non-ideal hardware, in the sense that it allows us to identifysome fundamental differences in the behavior of massive These are model assumptions that originate from the experimental worksof [15]–[17]. An analytic motivation of the assumptions (which should notbe misinterpret as a proof) can be obtained from the Bussgang theorem; seeSection VII.

MIMO systems as compared to the case of ideal hardware.However, it cannot capture all practical characteristics of resid-ual transceiver hardware impairments. Possible reﬁnements,and their respective implications on our analytical results andobservations, are outlined in Section VII.

C. Uplink System Model

The reciprocal UL channel is used for pilot-based channelestimation and data transmission; see Fig. 1 and SectionsIII–IV. Similar to (2), we consider a system model with thereceived signal z ∈ C N at the BS being z = h ( d + η UE t ) + η BS r + ν (6)where d ∈ C is either a deterministic pilot signal (used forchannel estimation) or a stochastic data signal; in any case,the average power is p UE = E {| d | } . The additive term ν = ν noise + ν interf ∈ C N × is an ergodic process thatconsists of independent receiver noise ν noise ∼ CN ( , σ I ) as well as potential interference ν interf from other simultane-ous transmissions. The interference is independent of d butmight depend on the channel realizations in H . Moreover, theinterference statistics can be different in the pilot and datatransmission phases; for example, it is common to assume thateach cell uses time-division multiple access (TDMA) for pilottransmission, since this can provide sufﬁcient CSI accuracyto enable spatial-division multiple access (SDMA) for datatransmission [9]–[13]. Therefore, we assume that ν interf haszero mean and S = E { ν interf ν H interf } is that the covariancematrix during pilot transmission. We assume that S has auniformly bounded spectral norm, (cid:107) S (cid:107) = O (1) , for the samephysical reasons as for R . For data transmission, we deﬁnethe conditional covariance matrix Q H = E { ν interf ν H interf |H} ,in a coherence period with channel realizations H , and thecorresponding long-term covariance matrix E { Q H } . The co-variance matrices S , Q H ∈ C N × N are positive semi-deﬁnite.The spectral norm of Q H might grow unboundedly with N due to pilot contamination in multi-cell scenarios [9]–[13]; seeSection VI for further details.Similar to the DL, the aggregate residual transceiver impair-ments in the hardware used for UL transmission are modeledby the independent distortion noises η UE t ∈ C and η BS r ∈ C N × at the transmitter and receiver, respectively. Theseergodic stochastic processes are independent of d , but dependon the channel realizations H . The conditional distribution fora given H are η UE t ∼ CN (0 , υ UE t ) and η BS r ∼ CN ( , Υ BS r ) .Similar to (3) and (4), the conditional covariance matrices aremodeled as υ UE t = κ UE t p UE (7) Υ BS r = κ BS r p UE diag( | h | , . . . , | h N | ) . (8)Note that the hardware quality is characterized by κ BS t , κ BS r at the BS and by κ UE t , κ UE r at the UE. We can have κ BS t (cid:54) = κ BS r and κ UE t (cid:54) = κ UE r since different transceiver chains are used fortransmission and reception at a device.Generally speaking, we would like to achieve high perfor-mance using cheap hardware. This is particularly evident inmassive MIMO systems since the deployment cost of large antenna arrays might scale linearly with N unless we canaccept higher levels of impairments, κ BS t , κ BS r , at the BSs thanin conventional systems. This aspect is analyzed in Section V.III. U PLINK C HANNEL E STIMATION

This section considers estimation of the current channelrealization h by comparing the received UL signal z in (6)with the predeﬁned UL pilot signal d (recall: p UE = | d | ).The classic results on pilot-based channel estimation considerRayleigh fading channels that are observed in independentcomplex Gaussian noise with known statistics [32]–[35]. How-ever, this is not the case herein because the distortion noises η UE t and η BS r effectively depend on the unknown stochasticchannel h . The dependence is either through the multiplication h η UE t or the conditional variance of η BS r in (8), which isessentially the same type of relation. Although the distortionnoises are Gaussian when conditioned on a channel realization,the effective distortion is the product of Gaussian variablesand, thus, has a complex double Gaussian distribution [36]. Consequently, an optimal channel estimator cannot be deducedfrom the standard results provided in [32]–[35].We now derive the linear minimum mean square error(LMMSE) estimator of h under hardware impairments. Theorem 1.

The LMMSE estimator of h from the observationof z in (6) is ˆ h = d ∗ R ¯ Z − (cid:124) (cid:123)(cid:122) (cid:125) (cid:44) A z (9) where R diag = diag( r , . . . , r NN ) consists of the diagonalelements of R and the covariance matrix of z is denoted as ¯ Z = E { zz H } = p UE (1 + κ UE t ) R + p UE κ BS r R diag + S + σ I . (10) The total MSE is

MSE = E {(cid:107) ˆ h − h (cid:107) } = tr( C ) , where theerror covariance matrix is C = E { (ˆ h − h )(ˆ h − h ) H } = R − p UE R ¯ Z − R . (11) Proof:

The LMMSE estimator has the form ˆ h = Az where A minimizes the MSE. The MSE deﬁnition gives MSE = tr (cid:18) R − d AR − d ∗ RA H + A ¯ ZA H (cid:19) (12)where the expectations that involve η UE t , η BS r in MSE = E {(cid:107) ˆ h − h (cid:107) } are computed by ﬁrst having a ﬁxed valueof h and then average over h . The LMMSE estimator in(9) is achieved by differentiation of (12) with respect to A and equating to zero. This vector minimizes the MSE sincethe Hessian is always positive deﬁnite. The error covariancematrix and the MSE are obtained by plugging (9) into therespective deﬁnitions.Based on Theorem 1, the channel can be decomposed as h = ˆ h + (cid:15) where ˆ h is the LMMSE estimate in (9) and For example, the i th element of η BS r can be expressed as x i = h i ξ i ,which is the product of the i th channel coefﬁcient h i ∼ CN (0 , r ii ) and anindependent variable ξ i ∼ CN (0 , κ BS r p UE ) . The joint product distributionis complex double Gaussian with the PDF f ( x i ) = πµ i K (cid:16) | x i |√ µ i (cid:17) , where µ i = r ii κ BS r p UE is the variance and K ( · ) denotes the zeroth-order modiﬁedBessel function of the second kind [36]. (cid:15) ∈ C N × denotes the unknown estimation error. Contraryto conventional estimation with independent Gaussian noise(cf. [32, Chapter 15.8]), ˆ h and (cid:15) are neither independent norjointly complex Gaussian, but only uncorrelated and have zeromean. The covariance matrices are E { ˆ h ˆ h H } = R − C and E { (cid:15)(cid:15) H } = C where C was given in (11).We remark that there might exist non-linear estimatorsthat achieve smaller MSEs than the LMMSE estimator inTheorem 1. This stands in contrast to conventional channel es-timation with independent Gaussian noise, where the LMMSEestimator is also the MMSE estimator [34]. However, thedifference in MSE performance should be small, since thedependent distortion noises are relatively weak. Corollary 1.

Consider the special case of R = λ I and S = .The error covariance matrix in (11) becomes C = λ (cid:18) − p UE λp UE λ (1 + κ UE t + κ BS r ) + σ (cid:19) I . (13) In the high UL power regime, we have lim p UE →∞ C = λ (cid:18) −

11 + κ UE t + κ BS r (cid:19) I . (14)This corollary brings important insights on the averageestimation error per element in h . The error variance is givenby the factor in front of the identity matrix in (13). It isindependent of the number of antennas N , thus letting N growlarge neither increases nor decreases the estimation error perelement . The estimation error is clearly a decreasing functionof the pilot power p UE = | d | , but contrary to the idealhardware case the error variance is not converging to zeroas p UE → ∞ . As seen in (14), there is a strictly positive errorﬂoor of λ (1 − κ UE t + κ BS r ) due to the transceiver hardwareimpairments. Thus, perfect estimation accuracy cannot beachieved in practice, not even asymptotically. The error ﬂooris characterized by the sum of the levels of impairments κ UE t and κ BS r in the transmitter and receiver hardware, respectively.In terms of estimation accuracy, it is thus equally importantto have high-quality hardware at the BS and at the UE.Non-ideal hardware exhibits an error ﬂoor also when R is non-diagonal and when there is interference such that S (cid:54) = ; the general high-power limit is easily computed from(11). In fact, the results hold for any zero-mean channel andinterference distributions with covariance matrices R and S ,because the LMMSE estimator and its MSE are computedusing only the ﬁrst two moments of the statistical distributions[32], [34]. A. Impact of the Pilot Length

The LMMSE estimator in Theorem 1 considers a scalar pilotsignal d , which is sufﬁcient to excite all N channel dimensionsin the UL and is used in Section IV-B to derive lower boundson the UL and DL capacities. With ideal hardware and a totalpilot energy constraint, a scalar pilot signal is also sufﬁcientto minimize the MSE [34]. In contrast, we have non-ideal The MSE per element is ﬁnite, i.e. N tr( C ) < ∞ , but the sum MSEbehaves as tr( C ) → ∞ when N → ∞ since the number of elements grows. hardware and per-symbol energy constraints in this paper. Inthis case we can improve the MSE by increasing the pilotlength.Suppose we use a pilot signal d ∈ C × B that spans ≤ B ≤ T ULpilot channel uses and where each element of d hassquared norm p UE . A simple estimation approach would beto compute B separate LMMSE estimates, ˆ h i = h − (cid:15) i for i = 1 , . . . , B , using Theorem 1. By averaging, we obtain (cid:98) ˆ h = 1 B B (cid:88) i =1 ˆ h i = h − B B (cid:88) i =1 (cid:15) i . (15)If the distortion noises are temporally uncorrelated and iden-tically distributed, the MSE of the estimate (cid:98) ˆ h is E (cid:18) B B (cid:88) i =1 (cid:15) i (cid:19) H (cid:18) B B (cid:88) j =1 (cid:15) j (cid:19) = tr( C ) B . (16)Hence, the MSE goes to zero as /B when we increase thepilot length B , although the MSE per pilot channel use islimited by the non-zero error ﬂoor demonstrated in Corollary1. This is interesting because one pilot signal with energy Bp UE exhibits a noise ﬂoor, while B pilot signals with energy p UE per signal does not. This stands in contrast to the caseof ideal hardware, where the MSE is exactly the same in bothcases [34]. The reason is that we can average out the distortionnoise (similar to the law of large numbers) when we have B independent realizations.Despite the averaging effect, we stress that B ≤ T coher and thus there is always an estimation error ﬂoor for non-ideal hardware—we can, at most, reduce the ﬂoor by afactor /T coher by increasing the pilot length. Moreover, thederivation above is based on having temporally uncorrelateddistortions, but the distortions might be temporally correlatedin practice (especially if the same pilot signal d is transmittedmultiple times through the same hardware). In these cases, thebeneﬁt of increasing B is smaller and (cid:98) ˆ h should be replacedby an estimator that exploits the temporal correlation byestimating h jointly from all the B observations. Finally, wenote that it is of great interest to ﬁnd the B that maximizessome measure of system-wide performance, but this is outsidethe scope of our current paper. We refer to [34], [35], [37],[38] for some relevant works in the case of ideal hardware. B. Numerical Illustrations

This section exempliﬁes the impact of transceiver hardwareimpairments on the channel estimation accuracy.In Fig. 3, we consider N = 50 antennas at the BS and nointerference (i.e., S = ). The channel covariance matrix R is generated by the exponential correlation model from [39],which means that the ( i, j ) th element of R is [ R ] i,j = (cid:40) δ r j − i , i ≤ j,δ ( r i − j ) ∗ , i > j, (17) Since we have per-symbol energy constraints, what we really compare isone system that has an average symbol energy of Bp UE and one with p UE . where δ is an arbitrary scaling factor. This model basicallydescribes a uniform linear array (ULA) where the correlationfactor between adjacent antennas is given by | r | (for ≤ | r | ≤ ) and the phase ∠ r describes the angle of arrival/departureas seen from the array. The correlation factor | r | determinesthe eigenvalue spread in R , while ∠ r determines the corre-sponding eigenvectors. Since we simulate channel estimationwithout interference, the angle has no impact on the MSEand we can let r be real-valued without loss of generality. Weconsider a correlation coefﬁcient of r = 0 . , which is a modestcorrelation in the sense of behaving similarly to an array withhalf-wavelength antenna spacings and a large angular spreadof 45 degrees (cf. [40, Fig. 1] which shows how practicalangular spreads map non-linearly to | r | ).Fig. 3 shows the relative estimation error per channelelement, MSE rel = MSEtr( R ) , as a function of the average SNRin the UL, deﬁned as SNR UL = p UE tr( R ) N σ . (18)Based on the typical EVM ranges described in Remark 1,we consider four hardware setups with different levels of im-pairments: κ UE t = κ BS r ∈ { , . , . , . } . We comparethe LMMSE estimator in Theorem 1 with the conventionalimpairment-ignoring MMSE estimator from [32]–[34]. Fig. 3 conﬁrms that there are non-zero error ﬂoors at highSNRs, as proved by Corollary 1 and the subsequent discussion.The error ﬂoor increases with the levels of impairments. Theestimation error is very close to the ﬂoor when the uplinkSNR reaches – dB, thus further increase in SNR onlybrings minor improvement. This tells us that we need an uplinkSNR of at least 20 dB to fully utilize massive MIMO, becausecoherent transmission/reception requires accurate CSI. LowerSNRs can be compensated by adding extra antennas (see Fig. 6in Section IV), but the practical performance not as large.Moreover, Fig. 3 shows that the conventional impairment-ignoring estimator is only slightly worse than the proposedLMMSE estimator. This indicates that although hardwareimpairments greatly affect the estimation performance, it onlybrings minor changes to the structure of the optimal estimator.The inﬂuence of the estimation error ﬂoors depend on theanticipated spectral efﬁciency, the uplink SNR, and the numberof antennas. To gain some insight, suppose we have idealhardware and that the fraction of channel uses allocated for ULdata transmission is T ULdata /T coher = 0 . . The uplink spectralefﬁciency can then be approximated as .

45 log (cid:32) − MSE rel

MSE rel + N SNR UL (cid:33) (19)by using [41, Lemma 1]. When the number of antennas islarge, such that N SNR UL → ∞ , this approximation gives aspectral efﬁciency of 1.5 [bit/channel use] for MSE rel = 10 − and 4.5 [bit/channel use] for MSE rel = 10 − . The impact ofthe estimation errors on systems with non-ideal hardware is Note that the MSE of any linear estimator ˜ Az can be computed byplugging the matrix ˜ A into the general MSE expression in (12). The differencein MSE is easily quantiﬁed by comparing with tr( C ) using (11). −4 − − − Average SNR [dB] R e l a t i v e E s t i m a t i on E rr o r pe r A n t enna Conventional Impairment − IgnoringLMMSE EstimatorError Floors κ UE t = κ BS r = 0 . κ UE t = κ BS r = 0 . κ UE t = κ BS r = 0 . κ UE t = κ BS r = 0 Fig. 3: Estimation error per antenna element for the LMMSEestimator in Theorem 1 and the conventional impairment-ignoring MMSE estimator. Transceiver hardware impairmentscreate non-zero error ﬂoors.considered in Section IV, where we derive lower and uppercapacity bounds and analyze these for different SNRs andnumber of antennas.Next, we illustrate the possible improvement in estimationaccuracy by increasing the pilot length to comprise B ≥ channel uses. As discussed in Section III-A, it is not clearwhether the distortion noise is temporally uncorrelated or cor-related in practice. Therefore, we ﬁx the levels of impairmentsat κ UE t = κ BS r = 0 . and consider the two extremes:temporally uncorrelated and fully correlated distortion noises.The latter means that the distortion noise realizations are thesame for all B channel uses, since the same pilot signal isalways distorted in the same way. The channel and interferencestatistics are as in the previous ﬁgure (i.e., N = 50 , S = , and R is given by the exponential correlation model with r = 0 . ).The relative estimation error per antenna element is shownin Fig. 4 as a function of the pilot length. We also show theperformance with ideal hardware as a reference. At a low SNRof 5 dB, hardware impairments have little impact and there isa small but clear gain from increasing the pilot length becausethe total pilot energy increases as Bp UE . At a high SNR of 30dB, the temporal correlation has a large impact. Only smallimprovements are possible in the fully correlated case, sinceonly the receiver noise can be mitigated by increasing B . Inthe uncorrelated case the distortion noise can be also mitigatedby increasing B . This gives a logarithmic slope similar to thecase of ideal hardware. We stress that the actual performancelies somewhere in between the two extremes.Next, we consider different channel covariance models:1) Uncorrelated antennas R = I . (Equivalent to the expo-nential correlation model in (17) with r = 0 .)2) Exponential correlation model with r = 0 . .3) One-ring model with 20 degrees angular spread [42].4) One-ring model with 10 degrees angular spread [42].The exponential correlation model was deﬁned in (17). Theclassic one-ring model assumes a ring of scatterers around the −4 − − − Pilot Length ( B ) R e l a t i v e E s t i m a t i on E rr o r pe r A n t enna Fully − Correlated Distortion NoiseUncorrelated Distortion NoiseIdeal Hardware

Fig. 4: Estimation error per antenna element for the LMMSEestimator in Theorem 1 as a function of the pilot length B . Thelevels of impairments are κ UE t = κ BS r = 0 . and differenttemporal correlations are considered.UE, while there is no scattering close to the BS [42]. From theBS perspective, the multipath components arrive from a mainangle of arrival (here: degrees) and a small angular spreadaround it (here: or degrees). The BS is assumed to havea ULA with half-wavelength antenna spacings. An importantproperty of this model is that R might not have full rank as N grows large [43], [44], due to insufﬁcient scattering.The relative estimation error per channel element is shownin Fig. 5 for these four channel covariance models. Weconsider two SNRs (5 and 30 dB), hardware impairments with κ UE t = κ BS r = 0 . , and show the estimation errors as afunction of the number of BS antennas. The main observationfrom Fig. 5 is that the choice of covariance model has a largeimpact on the estimation accuracy. It was proved in [34] thatspatially correlated channels are easier to estimate and thisis consistent with our results; increasing the coefﬁcient r inthe exponential correlation model and decreasing the angularspread in the one-ring model lead to higher spatial correlationand smaller errors in Fig. 5. However, the error ﬂoors dueto hardware impairments make the difference between themodels reduce with the SNR. Moreover, the estimation error per antenna is virtually independent of N in the exponentialcorrelation model, while increasing N improves the error per antenna in the one-ring model. This is explained by the limitedrichness of the propagation environment in the one-ring model,which is a physical property that we can expect in practice. Remark 2 (Acquiring Large Covariance Matrices) . The pro-posed channel estimator requires knowledge of the N × N covariance matrices R and S . It becomes increasingly difﬁcultto acquire consistent estimates of covariance matrices as theirdimensions grow large [45]. Fortunately, the channel statisticshave a much larger coherence time and coherence bandwidththan the channel realization itself; thus, one can obtainmany more observations in the covariance estimation than inchannel vector estimation. Robust covariance estimators for −3 −2 −1 Number of Base Station Antennas ( N ) R e l a t i v e E s t i m a t i on E rr o r pe r A n t enna Case 1: UncorrelatedCase 2: Exponential Mod. r = 0 . Case 3: One-Ring, 20 degreesCase 4: One-Ring, 10 degrees

Fig. 5: Estimation error per antenna element for the LMMSEestimator in Theorem 1 as a function of the number ofBS antennas. Four different channel covariance models areconsidered and κ UE t = κ BS r = 0 . . large matrices were recently considered in [46]. The impactof imperfect covariance information on the channel estimationaccuracy was analyzed in [47]. The authors observe that theusual improvement in MSE from having spatial correlationvanishes if the covariance information cannot be trusted,but the MSE degradation is otherwise small (if the esti-mated covariance matrices are robustiﬁed). Another problemis that the large-dimensional matrix inversion in (9) is verycomputationally expensive, but [47] proposed low-complexityapproximations based on polynomial expansions.Instead of acquiring the covariance matrix of a user directly,the coverage area can be divided into “location bins” withapproximately the same channel statistics within each bin [48].By acquiring and storing the covariance matrices for each binin advance, it is sufﬁcient to estimate the location of a userand then associate the user with the corresponding bin. IV. D

OWNLINK AND U PLINK D ATA T RANSMISSION

This section analyzes the ergodic channel capacities of thedownlink in (2) and the uplink in (6), under the ﬁxed TDDprotocol depicted in Fig. 2. More precisely, we derive upperand lower capacity bounds that reveal the fundamental impactof non-ideal hardware. These bounds are based on having per-fect CSI (i.e., exact knowledge of h ) and imperfect pilot-basedCSI estimation (using the LMMSE estimator in Theorem 1),respectively. Since these are two extremes, the capacity boundshold when using the channel estimation technique proposed inSection III and for any better CSI acquisition technique thatcan be derived in the future. We now deﬁne the DL and ULcapacities for arbitrary CSI quality at the BS and UE.We consider the ergodic capacity (in bit/channel use) ofthe memoryless DL system in (2). In each coherence period,the BS has some arbitrary imperfect knowledge H BS of thecurrent channel states H and uses it to select the conditionaldistribution f ( s |H BS ) of the data signal s . The UE has a separate arbitrary imperfect knowledge H UE of the currentchannel states H and uses it to decode the data. Based onthe well-known capacity expressions in [49], the ergodic DLcapacity is C DL = T DLdata T coher E (cid:26) max f ( s |H BS ) : E {(cid:107) s (cid:107) }≤ p BS I ( s ; y |H , H BS , H UE ) (cid:27) (20)where I ( s ; y |H , H BS , H UE ) denotes the conditional mutualinformation between the received signal y and data signal s fora given channel realization H and given channel knowledge H BS and H UE . The expectation in (20) is taken over thejoint distribution of H , H BS , and H UE . Note that the factor T DLdata /T coher is the ﬁxed fraction of channel uses allocated forDL data transmission.In addition, the ergodic capacity (in bit/channel use) of thememoryless fading UL system in (6) is C UL = T ULdata T coher E (cid:26) max f ( d |H UE ) : E {| d | }≤ p UE I ( d ; z |H , H BS , H UE ) (cid:27) (21)where I ( d ; z |H , H BS , H UE ) denotes the conditional mutualinformation between the received signal z and data signal d fora given channel realization H and given channel knowledge H BS and H UE . The conditional probability distribution ofthe data signal is denoted f ( d |H UE ) and the expectation in(21) is taken over the joint distribution of H , H BS , H UE . Thefraction of channel uses allocated for UL data transmission is T ULdata /T coher .There are a few implicit properties in the capacity deﬁni-tions. Firstly, the interference variance I UE H and covariancematrix Q H depend on the channel realizations H and changebetween coherence periods. We are not limiting the analysisto any speciﬁc interference models but take care of it inthe capacity bounds; the lower bounds treat the interferenceas Gaussian noise, while the upper bounds assume perfectinterference suppression. Section VI describes the interferencein multi-cell scenarios in detail. Secondly, we assume that thedistortion noises are temporally independent, which is a goodmodel when the data signals are also temporally independent.The next subsections study the capacity behavior in thelimit of inﬁnitely many BS antennas ( N → ∞ ), which bringinsights on how hardware impairments affect channels withlarge antenna arrays. The DL and UL are analyzed side-by-side since the results follow from similar derivations. A. Upper Bounds on Channel Capacities

Upper bounds on the capacities in (20) and (21) can beobtained by adding extra channel knowledge and removingall interference (i.e., I UE H = 0 and Q H = ). We assumethat the UL/DL pilot signals provide the BS and UE withperfect channel knowledge in each coherence period: H BS = H UE = H . Since the receiver noise and distortion noises in (2)and (6) are circularly symmetric complex Gaussian distributedand independent of the useful signals under perfect CSI, wededuce that Gaussian signaling is optimal in the DL and UL[2] and that single-stream transmission with rank( W ) = 1 issufﬁcient to achieve optimality [6]; that is, we can set s = w s for s ∼ CN (0 , p BS ) and some unit-norm beamforming vector w in the DL and d ∼ CN (0 , p UE ) in the UL. This gives usthe following initial upper bounds. Lemma 1.

The downlink and uplink capacities in (20) and (21) , respectively, are bounded as C DL ≤ T DLdata T coher × (22) E (cid:26) log (1 + h H (cid:16) κ BS t D | h | + κ UE r hh H + σ p BS I (cid:17) − h (cid:27) C UL ≤ T ULdata T coher × (23) E (cid:26) log (cid:18) h H (cid:16) κ UE t hh H + κ BS r D | h | + σ p UE I (cid:17) − h (cid:19)(cid:27) where D | h | = diag( | h | , . . . , | h N | ) with h =[ h . . . h N ] T . These upper bounds are achieved with equalityunder perfect CSI, using the beamforming vector w DLupper = ( κ BS t D | h | + σ p BS I ) − h ∗ (cid:13)(cid:13)(cid:0) κ BS t D | h | + σ p BS I (cid:1) − h ∗ (cid:13)(cid:13) (24) in the downlink and by applying a receive combining vector w ULupper = ( κ BS r D | h | + σ p UE I ) − h (cid:13)(cid:13) ( κ BS r D | h | + σ p UE I ) − h (cid:13)(cid:13) . (25) in the uplink. Proof:

The proof is given in Appendix C-A.Note that the beamforming vector in (24) and receivecombining vector in (25) only depend on the channel vector h , hardware impairments at the BS, and the receiver noise.Hardware impairments at the UE have no impact on w DLupper and w ULupper since their distortion noise essentially act as aninterferer with the same channel as the data signal; thusﬁltering cannot reduce it.The bounds in Lemma 1 are not amenable to simpleanalysis, but the lemma enables us to derive further boundson the channel capacities that are expressed in closed form.

Theorem 2.

The downlink and uplink capacities in (20) and (21) , respectively, are bounded as C DL ≤ C DLupper = T DLdata T coher log (cid:18) G DL κ UE r G DL (cid:19) (26) C UL ≤ C ULupper = T ULdata T coher log (cid:18) G UL κ UE t G UL (cid:19) (27) A receive combining vector w is a linear ﬁlter w H z that transforms thesystem into an effective single-input single-output (SISO) system. where r , . . . , r NN are the diagonal elements of R , G DL = N (cid:88) i =1 κ BS t  − σ e σ p BS κ BS t rii p BS κ BS t r ii E (cid:18) σ p BS κ BS t r ii (cid:19) , (28) G UL = N (cid:88) i =1 κ BS r  − σ e σ p UE κ BS r rii p UE κ BS r r ii E (cid:18) σ p UE κ BS r r ii (cid:19) , (29) and E ( x ) = (cid:82) ∞ e − tx t dt denotes the exponential integral.Proof: The proof is given in Appendix C-B.These closed-form upper bounds provide important insightson the achievable DL and UL performance under transceiverhardware impairments. In particular, the following two corol-laries provide some ultimate capacity limits in the asymptoticregimes of many BS antennas or large transmit powers.

Corollary 2.

The downlink upper capacity bound in (26) hasthe following asymptotic properties: lim p BS →∞ C DLupper = T DLdata T coher log (cid:18) Nκ BS t + κ UE r N (cid:19) (30) lim N →∞ C DLupper = T DLdata T coher log (cid:18) κ UE r (cid:19) . (31) Proof:

The diagonal elements of R satisfy r ii > ∀ i ,by deﬁnition, thus G DL → (cid:80) Ni =1 1 κ BS t = Nκ BS t as p BS → ∞ for ﬁxed N , giving (30). The positive diagonal elements alsoimplies N G DL > as N → ∞ , thus G DL κ UE r G DL − G DL κ UE r G DL → as N → ∞ which turns (26) into (31).This corollary shows that the DL capacity has ﬁnite ceilingswhen either the DL transmit power p BS or the number of BSantennas N grow large. The ceilings depend on the impairmentparameters κ BS t and κ UE r , but the UE impairments are clearly N times more inﬂuential. Note that even very small hardwareimpairments will ultimately limit the capacity. In other words,the ever-increasing capacity observed in the high-SNR andlarge- N regimes with ideal transceiver hardware (cf. [9]–[13])is not easily achieved in practice.The next corollary provides analogous results for the UL. Corollary 3.

The uplink upper capacity bound in (27) has thefollowing asymptotic properties: lim p UE →∞ C ULupper = T ULdata T coher log (cid:18) Nκ BS r + κ UE t N (cid:19) (32) lim N →∞ C ULupper = T ULdata T coher log (cid:18) κ UE t (cid:19) . (33) Proof:

This is proved analogously to Corollary 2.As seen from Corollary 3, the UL capacity also has ﬁniteceilings when either the UL transmit power p UE or thenumber of antennas N grow large. Analogous to the DL, theUE impairments are N times more inﬂuential than the BSimpairments and thus dominate as N → ∞ .The upper bounds in Corollaries 2 and 3 show that the DLand UL capacities are fundamentally limited by the transceiver hardware impairments. To be certain of the cause of theselimits, we also need lower bounds on the channel capacities. B. Lower Bounds on Channel Capacities

We obtain lower capacity bounds by making the poten-tially limiting assumptions of Gaussian codebooks, treatinginterference as Gaussian noise, using linear single-streambeamforming in the DL, using linear receive combining inthe UL, pilot-based channel estimation as in Theorem 1,and the entropy-maximizing Gaussian distribution on the CSIuncertainty at the receiver of the DL and UL. The resultinglower bound is given in the following theorem.

Theorem 3.

Let ˜ H UE and ˜ H BS denote the CSI availablein the decoding at the receiver in the downlink and uplink,respectively. These are degraded as compared to H UE and H BS or equal. The downlink and uplink capacities in (20) and (21) , respectively, are then bounded as C DL ≥ C DLlower = T DLdata T coher E (cid:8) log (cid:0) SINR

DLlower ( v DL ) (cid:1)(cid:9) (34) C UL ≥ C ULlower = T ULdata T coher E (cid:8) log (cid:0) SINR

ULlower ( v UL ) (cid:1)(cid:9) (35) where the beamforming vector v DL = [ v DL1 . . . v DL K r ] T and thereceive combining vector v UL = [ v UL1 . . . v UL K r ] T are functionsof ˆ h and have unit norms. The expectations are taken over ˜ H UE and ˜ H BS , while the SINR expressions for DL and UL are givenin (36) and (37) , respectively, at the top of the next page.Proof: This theorem is obtained by taking lower boundson the mutual information in the same way as was previouslyproposed in [5] and [41]. This bounding technique was appliedto massive MIMO systems with ideal hardware in [11]–[13](among others), by making the limiting assumptions listedin the beginning of this subsection. The distortion noisesfrom non-ideal hardware act as additional noise sources withspatially correlated covariance matrices, thus these can easilybe incorporated into the proofs used in previous works.This theorem is the key to the lower capacity bounding inthis paper. The lower bounds in (34) and (35) can be computednumerically for any channel distribution and any way ofselecting the beamforming vector (in the DL) and receivercombining vector (in the UL) from the channel estimate ˆ h ,provided that the conditional distribution of ( h , ˆ h ) given ˜ H canbe characterized. To bring explicit insights on the behaviorwhen the number of antennas, N , grows large, we have thefollowing result for the cases of (approximate) maximum ratiotransmission (MRT) in the DL and (approximate) maximumratio combining (MRC) in the UL. Theorem 4.

Assume that no instantaneous CSI is utilized fordecoding (i.e., ˜ H BS = ˜ H UE = ∅ ). For v = ˆ h (cid:107) ˆ h (cid:107) the terms in The linear processing assumption is motivated by its asymptotic optimalityas N → ∞ [9]. Finding such a characterization is a challenging task, except for the case ˜ H BS = ˜ H UE = ∅ considered in Theorem 4. (36) and (37) behave as (cid:12)(cid:12) E { h H v } (cid:12)(cid:12) = | E { ϕ }| tr( R − C ) + O ( √ N ) (38) E (cid:8) | h H v | (cid:9) = E (cid:8) | ϕ | (cid:9) tr( R − C ) + O ( √ N ) (39) N (cid:88) i =1 E {| h i | | v i | } = O (1) (40) where ϕ = (1 + d − η UE t ) (cid:112) tr( R − C ) (cid:113) tr (cid:0) A ( | d + η UE t | R + Ψ ) A H (cid:1) (41) is a function of the stochastic variable η UE t while A = d ∗ R ¯ Z − and Ψ = p UE κ BS r R diag + S + σ I are deterministicmatrices.Proof: The proof is given in Appendix C-C.Similar asymptotic behaviors were derived in [11]–[13] forthe case of ideal hardware. In the general case with hardwareimpairments, the expectations of ϕ and | ϕ | must be computednumerically, because the randomness of the scalar distortionnoise η UE t at the UE remains even when N grows large. Inthe special case of κ UE t = 0 (which implies η UE t = 0 ), (38)and (39) both reduce to tr( R − C ) + O ( √ N ) . For κ UE t > ,the terms in Theorem 4 are easy to compute numerically.Based on this result, we provide now an asymptotic char-acterization of the downlink capacity. Corollary 4.

Consider the DL with beamforming vector v = ˆ h ∗ (cid:107) ˆ h (cid:107) and ˜ H UE = ∅ . If E { I UE H } ≤ O ( N n ) for some n < ,the lower capacity bound in (34) can be expressed as C DL ≥ T DLdata T coher × log  | E { ϕ }| + O (cid:16) √ N (cid:17) (1+ κ UE r ) E {| ϕ | }−| E { ϕ }| + O (cid:16) √ N + N − n (cid:17)  (42) where ϕ is given in (41) . The terms O (cid:16) √ N (cid:17) and O (cid:0) N − n (cid:1) vanish when N → ∞ , while the other terms are strictlypositive in the limit.Proof: The expression (42) is obtained from (34) byplugging in the expressions in Theorem 4 and multiplyingeach term by R − C ) = p UE tr( R ¯ Z − R ) = O ( N − ) . Theinterference term becomes E { I UE H } p BS tr( R − C ) = O (cid:0) N − n (cid:1) .Combining the upper bound in Corollary 2 with the lowerbound in Corollary 4, we have a clear characterization ofthe DL capacity behavior when N → ∞ . Both bounds areindependent of κ BS t in the limit, thus the transmitter hardwareof the BS plays little role in massive MIMO systems. Contraryto the upper bound, the level of receiver hardware impairments We stress that the assumption in Theorem 4 that decoding is performedwithout instantaneous CSI is only made to enable closed-form lower bounds.The BS should certainly exploit the channel estimate ˆ h and the UE mightreceive a downlink pilot signal that enables estimation of the effective channel h H v DL . While this is relatively easy to handle with ideal hardware, where thechannel estimate and estimation error are independent (cf. [12]), the extensionto non-ideal hardware seems intractable due the statistical dependence betweenthe channel estimate and estimation error. SINR

DLlower ( v DL ) = (cid:12)(cid:12)(cid:12) E { h H v DL | ˜ H UE } (cid:12)(cid:12)(cid:12) (1+ κ UE r ) E (cid:110) | h H v DL | | ˜ H UE (cid:111) − (cid:12)(cid:12)(cid:12) E { h H v DL | ˜ H UE } (cid:12)(cid:12)(cid:12) + κ BS t N (cid:80) i =1 E {| h i | | v DL i | | ˜ H UE } + E { I UE H | ˜ H UE } p BS + σ p BS (36) SINR

ULlower ( v UL ) = (cid:12)(cid:12)(cid:12) E { h H v UL | ˜ H BS } (cid:12)(cid:12)(cid:12) (1+ κ UE t ) E (cid:110) | h H v UL | | ˜ H BS (cid:111) − (cid:12)(cid:12)(cid:12) E { h H v UL | ˜ H BS } (cid:12)(cid:12)(cid:12) + κ BS r N (cid:80) i =1 E {| h i | | v UL i | | ˜ H BS } + E { ( v UL ) H ( Q H + σ I ) v UL | ˜ H BS } p UE (37)at the BS ( κ BS r ) is present in the lower bound (42), through A and Ψ in ϕ . However, the numerical results in SectionIV-C reveal that the asymptotic impact of BS impairmentsis negligible also in the lower bound. This can also be seenanalytically in certain cases; if κ UE t = 0 we get ϕ = 1 andtherefore lim N →∞ C DLlower = T DLdata T coher log (cid:18) κ UE r (cid:19) . (43)In this special case, the lower bound actually approaches theupper bound in (31) asymptotically, and any DL capacity canbe achieved by making κ UE r sufﬁciently small. The oppositeis not true; setting κ BS r = 0 will not make the impact of UEimpairments vanish. We therefore conclude that the DL capac-ity limit is mainly determined by the level of impairments atthe UE, both in the uplink estimation ( κ UE t ) and the downlinktransmission ( κ UE r )—although the former connection was notvisible in the upper bound since it was based on perfect CSI.For the uplink, we have the following similar asymptoticcapacity characterization. Corollary 5.

Consider the UL with receive combining vector v = ˆ h (cid:107) ˆ h (cid:107) and ˜ H BS = ∅ . If E {(cid:107) Q H (cid:107) } ≤ O ( N n ) for some n < , the lower capacity bound in (35) can be expressed as C UL ≥ T ULdata T coher × log  | E { ϕ }| + O (cid:16) √ N (cid:17) (1+ κ UE t ) E {| ϕ | }−| E { ϕ }| + O (cid:16) √ N + N − n (cid:17)  (44) where ϕ is given in (41) . The terms O (cid:16) √ N (cid:17) and O (cid:0) N − n (cid:1) vanish when N → ∞ , while the other terms are strictlypositive in the limit.Proof: The expression (44) is obtained from (35) byplugging in the expressions from Theorem 4 and multiplyingeach term by R − C ) = p UE tr( R ¯ Z − R ) = O ( N − ) . Theinterference term becomes E { v H Q H v } p UE tr( R − C ) = O (cid:0) N − n (cid:1) .The upper bound in Corollary 3 and the lower bound inCorollary 5 provide a joint characterization of the uplinkcapacity when N grows large. The UE impairments manifestthe behavior in both bounds; the BS impairments are presentin (42) since ϕ depends on A and Ψ , but their impact vanishwhen κ UE t → . By making κ UE t appropriately small, we canthus achieve any UL capacity as N grows large. We therefore conclude that it is of main importance to have high qualityhardware at the UE, which is analog to our observations forthe DL. These observations are illustrated numerically in thenext subsection and are explained by the following remark. Remark 3 (BS Impairments Vanish Asymptotically) . Thelower and upper bounds show that it is the quality of the UE’stransceiver hardware that limits the DL and UL capacities as N → ∞ . Thus, the detrimental effect of hardware impairmentsat the BS vanishes completely, or almost completely, when thenumber of BS antennas grows large. This is, simply speaking,since the BS’s distortion noises are spread in arbitrary direc-tions in the N -dimensional vector space while the increasedspatial resolution of the array enables very exact transmitbeamforming and receive combining for the useful signal. Thisis a very promising result since large arrays are more proneto impairments, due to implementation limitations and the willto use antenna elements of lower quality (to avoid havingdeployment costs that increase linearly with N ). In contrast,the UE’s distortion noises are non-vanishing since they behaveas interferers with the same effective channels as the usefulsignals. Corollaries 4 and 5 assumed that the inter-user interferencesatisfy E { I UE H } ≤ O ( N n ) and E {(cid:107) Q H (cid:107) } ≤ O ( N n ) ,respectively, for some n < . These conditions imply that theinterference terms only vanish asymptotically if the scalingwith N is slower than linear. This is satisﬁed by regular interference which has constant variance (i.e., n = 0 ), but thereis a special type of non-regular pilot contaminated interferencein multi-cell systems that scales linearly with N . This adds anadditional non-vanishing term to the denominators of (42) and(44). We detail this scenario in Section VI.Finally, we stress that the DL and UL capacity boundsin Corollaries 2 and 3, respectively, have a very similarstructure. The main difference is that the UL is only affectedby UL hardware impairments (i.e., κ UE t , κ BS r ), while the DLis affected by both DL and UL hardware impairments (i.e., all κ -parameters) due to the reverse-link channel estimation. C. Numerical Illustrations

Next, we illustrate the lower and upper bounds on thecapacity that were derived earlier in this section. We considera scenario without interference, Q H = S = and I UE H = 0 ,and deﬁne the average SNRs as p UE tr( R ) Nσ and p BS tr( R ) Nσ inthe UL and DL, respectively. We consider different ﬁxed SNRvalues, while we vary the number of antennas N and the levels N ) S pe c t r a l E ff i c i en cy [ b i t/ c hanne l u s e ] Capacity: Upper BoundsCapacity: Lower BoundsAsymptotic Limits (Upper & Lower) κ BS = κ UE = 0 . κ BS = κ UE = 0 . κ BS = κ UE = 0 (a) SNR: 20 dB N ) S pe c t r a l E ff i c i en cy [ b i t/ c hanne l u s e ] Capacity: Upper BoundsCapacity: Lower BoundsAsymptotic Limits (Upper & Lower) κ BS = κ UE = 0 . κ BS = κ UE = 0 . κ BS = κ UE = 0 (b) SNR: 0 dB Fig. 6: Lower and upper bounds on the capacity. Hardwareimpairments have a fundamental impact on the asymptoticbehavior as N grows large.of hardware impairments. We assume that the transmitter andreceiver hardware of each device are of the same quality: κ BS (cid:44) κ BS t = κ BS r at the BS and κ UE (cid:44) κ UE t = κ UE r at theUE. Furthermore, we assume T DLdata T coher = T ULdata T coher = 0 . , whichare the percentage of DL data and UL data. These assumptionsmake the bounds for the DL and UL capacities becomeidentical, thus we can simulate the DL and UL simultaneously.Fig. 6 considers a spatially uncorrelated scenario with R = I for different levels of impairments: κ UE t = κ BS r ∈{ , . , . } . The meaning of these parameter values wasdiscussed in Remark 1. Simulation results are given for SNRsof 20 dB and 0 dB. The capacity with ideal hardware growswithout bound as N → ∞ , while the lower and upper boundsconverge to ﬁnite limits under transceiver hardware impair-ments. The main difference between the two SNR values isthe convergence speed, while the upper bounds are exactly thesame and the lower bounds are approximately the same. Recallthat these bounds hold under any CSI H BS at the BS and H UE The transmitter and receiver hardware both involve converters, mixers,ﬁlters, and oscillators; see [30, Fig. 1] for a typical transceiver model. Themain difference is the type of ampliﬁers, thus the assumption of identicallevels of impairments makes sense when the non-linearities of the ampliﬁersat the transmitter are not the dominating source of distortion noise. N ) S pe c t r a l E ff i c i en cy [ b i t/ c hanne l u s e ] Capacity: Upper BoundsCapacity: Lower BoundsAsymptotic Limits (Upper & Lower) κ BS ∈ { , . , . } Decreasing withIncreasingBS Impairments:

Fig. 7: Lower and upper bounds on the capacity for κ UE =0 . . The impact of hardware impairments at the BS vanishesasymptotically.at the UE; the lower bounds represent no instantaneous CSI inthe decoding step and the upper bounds represent perfect CSI.Although the gap between these extremes is large for idealhardware, the difference is remarkably small under non-idealhardware due to the ﬁnite capacity limit (caused by distortionnoise) and the channel hardening that makes stochastic innerproduct such as h H v become increasingly deterministic as N grows large. Since a main difference between the lower andupper bounds is the quality of the CSI, the small differenceshows that the estimation errors have only a minor impact onthe capacity; hence, the estimation error ﬂoors described inSection III has no dominating impact in the large- N regime.The asymptotic capacity limits in Fig. 6 are characterizedby the level of impairments, thus the hardware quality hasa fundamental impact on the achievable spectral efﬁciency.If the SNRs are sufﬁciently high (e.g., 20 dB), the majorityof the multi-antenna gain is achieved at relatively low N ;in particular, only minor improvements can be achieved byhaving more than N = 100 antennas. Larger numbers are,however, useful for inter-user interference suppression andmultiplexing; see Section VI. We need many more antennas toachieve convergence at 0 dB SNR than at 20 dB, because a 100times larger array gain is required to compensate for the lowerSNR. Hence, we conclude that the massive MIMO gains aremuch more attractive at higher SNRs (which matches well withthe results in Section III where 20–30 dB SNR was neededto achieve a close-to-perfect channel estimate). Therefore, weonly consider an SNR of 20 dB it the remainder of this section.Fig. 7 considers the same scenario as in Fig. 6 but witha ﬁxed level of impairments κ UE = 0 . at the UE anddifferent values at the BS. As expected from the analysis, thelower and upper capacity bounds increase with κ BS , but thedifference is only visible at small N since the curves convergeto virtually the same value as N → ∞ . This validates that theimpact of impairments at the BS vanishes as N grows large.Finally, we consider the capacity behavior for differentchannel covariance models, namely the four propagationscenarios described in Section III-B. The lower and uppercapacity bounds are shown in Fig. 8 for κ BS = κ UE = 0 . .The upper bound is identical for all the models, since it only N ) S pe c t r a l E ff i c i en cy [ b i t/ c hanne l u s e ] Upper BoundCase 1: UncorrelatedCase 2: Exponential Mod. r = 0 . Case 3: One-Ring, 20 degreesCase 4: One-Ring, 10 degreesAsymptotic Limits

Fig. 8: Lower and upper bounds on the capacity as a functionof the number of BS antennas. Four different channel covari-ance models are considered and the hardware impairments arecharacterized by κ BS = κ UE = 0 . .utilizes the diagonal elements of R . However, there are cleardifferences between the lower bounds. The spatially uncor-related covariance model provides the highest performance,while the strongly spatially correlated one-ring model from[42] with 10 degrees angular spread gives the lowest perfor-mance. This stands in contrast to Section III-B, where thehighly correlated channels gave the lowest estimation errors(i.e., highest estimation accuracy). However, the differencesbetween the channel covariance models vanish asymptoticallyas N → ∞ .V. I MPROVING E NERGY E FFICIENCY AND R EDUCING H ARDWARE Q UALITY

Next, we analyze how the energy efﬁciency (EE) can beoptimized in massive MIMO systems. The EE is measuredin bit/Joule and a common EE deﬁnition is the ratio of thespectral efﬁciency (in bit/channel use) to the emitted power (inJoule/channel use). It has recently been shown that the arraygain in massive MIMO systems can be utilized to reduce theemitted power; see [12], [13] for systems with ideal hardwareand [25] for systems with phase noise from free-runningoscillators. More speciﬁcally, these prior works show that onecan reduce the transmit powers as /N t , for < t < , andstill achieve non-zero spectral efﬁciencies as N → ∞ . Byfollowing this power scaling law, we can achieve an inﬁnitely high EE as N → ∞ because the numerator has a non-zerolimit and the denominator goes to zero as /N t [1]. Althoughthis property indicates that massive MIMO systems can bevery energy efﬁcient, the unboundedness also shows that theconventional EE metric needs to be revised when applied tomassive MIMO systems. In this section, we consider a reﬁnedmetric of overall EE (based on prior work in [50]–[54]) anduse it to analyze the overall EE of massive MIMO systems.Under the TDD protocol, the energy consumed in theampliﬁers of the transmitters (per coherence period) is E amp = ( T DLpilot + T DLdata ) p BS ω BS + ( T ULpilot + T ULdata ) p UE ω UE [Joule] (45) where the parameters ω BS , ω UE ∈ [0 , are the efﬁcienciesof the power ampliﬁers at the BS and UE, respectively. Theaverage power (in Joule/channel use) can then be separated as E amp T coher = α DL (cid:32) T DLpilot T coher p BS ω BS + T ULpilot T coher p UE ω UE (cid:33) + T DLdata T coher p BS ω BS (cid:124) (cid:123)(cid:122) (cid:125) Downlink power + α UL (cid:32) T DLpilot T coher p BS ω BS + T ULpilot T coher p UE ω UE (cid:33) + T ULdata T coher p UE ω UE (cid:124) (cid:123)(cid:122) (cid:125) Uplink power (46)where the ratios of DL and UL transmission are, respectively, α DL = T DLdata T DLdata + T ULdata (47) α UL = T ULdata T DLdata + T ULdata . (48)In addition to the power consumed by the ampliﬁers, thereis generally a baseband circuit power consumption whichwe model as N ρ + ζ [50]–[54]. The parameter ρ ≥ [Joule/channel use] describes the circuit power that scales withthe number antennas; for example, hardware components thatare needed at each antenna branch (e.g., converters, mixers,and ﬁlters) and computational complexity that is proportionalto N (e.g., channel estimation and computing MRT/MRC).In contrast, the parameter ζ > [Joule/channel use] is astatic circuit power term that is independent of N (but mightscale with the number UEs); for example, it models basebandprocessing at the BS and circuit power at the UE. Based on the power consumption model described above,and inspired by the seminal work in [55], we deﬁne the overallenergy efﬁciency (in bit/Joule) as follows.

Deﬁnition 1.

The downlink energy efﬁciency is EE DL = C DL α DL (cid:16) T DLpilot T coher p BS ω BS + T ULpilot T coher p UE ω UE + N ρ + ζ (cid:17) + T DLdata T coher p BS ω BS (49) and the uplink energy efﬁciency is EE UL = C UL α UL (cid:16) T DLpilot T coher p BS ω BS + T ULpilot T coher p UE ω UE + N ρ + ζ (cid:17) + T ULdata T coher p UE ω UE . (50) The EE of any suboptimal transmission scheme is obtained byreplacing the capacities C DL and C UL with the correspondingachievable spectral efﬁciencies. This deﬁnition considers a single link, which can be anyof the links in a massive MIMO system—the parameters ζ and ρ should then be interpreted as the energy per channel The efﬁciency of a speciﬁc ampliﬁer depends on the transmit power, butto facilitate analysis we assume that the ampliﬁer is optimized jointly with thetransmit power to give a speciﬁc efﬁciency at the particular power level. Theefﬁciency also depends on the PAPR and acceptable distortion noise, whichare two properties that we also keep ﬁxed when optimizing the EE. This term can also model the overhead power consumption of the networkas a whole, which enables comparison of network architectures with differentBS density, amounts of backhaul signaling, etc. use per user . In the process of maximizing the EE metricin Deﬁnition 1, we ﬁrst extend the power scaling laws from[12], [13], [25] to our general system model with non-idealhardware. Theorem 5.

Suppose the downlink transmit power p BS anduplink pilot power p UE are reduced with N proportionally to /N t BS and /N t UE , respectively. If t BS + t UE < , t BS ≥ , < t UE < , and E { I UE H } = O (1) , we have lim N →∞ C DL ≥ T DLdata T coher log (cid:18) κ UE r + κ UE t + κ UE r κ UE t (cid:19) . (51) Similarly, suppose the uplink transmit/pilot power p UE isreduced with N proportionally to /N t UE . If < t UE < and E {(cid:107) Q H (cid:107) } = O (1) , we have lim N →∞ C UL ≥ T ULdata T coher log (cid:18) κ UE t + ( κ UE t ) (cid:19) . (52) Proof:

The proof is given in Appendix C-D.Theorem 5 shows that one can reduce the downlink anduplink transmit powers as N grows large (e.g., roughlyproportionally to / √ N ) and converge to non-zero spectralefﬁciencies. The asymptotic DL capacity is lower boundedby (51) and the UL capacity by (52). As expected fromSection IV, these lower bounds only depend on the levels ofimpairments at the UE. The conditions E { I UE H } = O (1) and E {(cid:107) Q H (cid:107) } = O (1) in Theorem 5 are stronger than the onesin Corollaries 4 and 5, thus the interfering transmissions mighthave to reduce their transmit powers as well if their impactshould vanish asymptotically. We note that the lower boundsin Theorem 5 are achieved by using the LMMSE estimator inTheorem 1 for channel estimation and simple linear processingat the BS (approximate MRT in the DL and MRC in the UL).Based on Theorem 5 and the upper capacity bounds inSection IV, the following corollary describes how to maximizethe EE. Corollary 6.

Suppose we want to maximize the EE metricswith respect to the transmit powers and the number of anten-nas. Let E { I UE H } = O (1) and E {(cid:107) Q H (cid:107) } = O (1) . If ρ = 0 ,the maximal EEs are bounded as log (cid:16) κ UE r + κ UE t + κ UE r κ UE t (cid:17) T coher T DLdata α DL ζ ≤ max p BS ,p UE ≥ N ≥ EE DL ≤ log (cid:16) κ UE r (cid:17) T coher T DLdata α DL ζ (53) log (cid:16) κ UE t +( κ UE t ) (cid:17) T coher T ULdata α UL ζ ≤ max p UE ,N ≥ EE UL ≤ log (cid:16) κ UE t (cid:17) T coher T ULdata α UL ζ (54) where the lower bounds are achieved as N → ∞ using thepower scaling law in Theorem 5.If ρ > , the upper bounds in (53) and (54) are still valid,but the asymptotic EEs are lim N →∞ max p BS ,p UE ≥ EE DL = lim N →∞ max p UE ≥ EE UL = 0 (55) and, consequently, the EEs are maximized at some ﬁnite N .Proof: The lower bounds for ρ = 0 are achieved asdescribed in corollary, while the upper bounds follow from neglecting the transmit power term in the denominator andapplying the capacity upper bounds from Corollaries 2 and 3.In the case of ρ > , we note that the EE is non-zero for N = 1 for any non-zero transmit power, while the EE goesto zero as N → ∞ since the denominators of the EE metricsgrow to inﬁnity and the numerators are bounded.This corollary reveals that the maximal overall EE is ﬁnite,also in massive MIMO systems. If the circuit power consump-tion does not scale with N , such that ρ = 0 , we can achieve anEE very close to the upper bounds in (53) and (54) by havingvery many antennas. This changes completely when there is anon-zero circuit power per antenna: ρ > . The maximal EE isthen achieved at some ﬁnite N , which naturally depends on theparameters ρ , ζ , ω BS , and ω UE . We illustrate this dependencenumerically in the next subsection.Since ρ has a dominating impact on the maximal EE inmassive MIMO systems, one would like to ﬁnd a way to re-duce ρ . Generally speaking, the hardware power consumptiondepends on the circuit architecture and the hardware resolution[7], [21]; by tolerating larger hardware impairments we canalso reduce the power dissipation in the corresponding circuits.Now recall from Section IV that the impact of hardwareimpairments at the BS vanishes as N → ∞ . This factraises the important question: Can we increase the levels ofimpairments at the BS as N grows and still obtain non-zerocapacities? The answer is given by the following corollary. Corollary 7.

Suppose the levels of impairments κ BS t , κ BS r areincreased with N proportionally to N τ t and N τ r , respectively.The lower capacity bounds in Corollaries 4 and 5 (for n ≤ )converge to non-zero quantities as N → ∞ if τ r < in theUL and τ t + τ r < and τ r < in the DL.Proof: The proof is given in Appendix C-E.This corollary shows that we can indeed increase the levelsof impairments, κ BS t and κ BS r , at the BS roughly proportionallyto √ N and still have a non-zero asymptotic capacity. Thenumerical results in the next subsection shows that only minordegradations of the lower capacity bounds appear when theimpairment scaling law in Corollary 7 is followed.Recall from (5) that the conventional EVM measure oftransceiver quality equals the square root of the κ -parameters,thus Corollary 7 shows that the EVMs can be increasedproportionally to N / . A high-quality BS antenna elementwith an EVM of . can thus be replaced by 256 low-qualityantenna elements with an EVM of . , while the loss incapacity is negligible. This is a very encouraging result, sinceit indicates that massive MIMO can be deployed with BShardware components that are inexpensive, have lower qualityand thus low power consumption than conventional ones (i.e., ρ is smaller). If the hardware components are treated asoptimization variables, the maximal EE is achieved by jointlyreducing the transmit power and the circuit power consumptionwith N . This optimization is, however, strongly dependent onthe practical hardware setup (e.g., how an increased EVMmaps to a smaller circuit power dissipation) and is outsidethe scope of this paper. Finally, we note that the ability todegrade the hardware quality with N comes in addition to allother beneﬁts of massive MIMO, such as the array gain and the decorrelation of user channels; see the multi-cell results inSection VI. A. Numerical Illustrations

Next, we illustrate how the overall EE depends on thenumber of antennas, transmit powers, and circuit power pa-rameters ζ and ρ . Based on the power consumption numbersreported in [56, Table 7], we consider two setups: ζ + ρ =2 µ J / channel use and ζ + ρ = 0 . µ J / channel use. Theserepresent the total circuit power consumptions in a system with N = 1 antenna. Since the total circuit power for arbitrary N is ζ + N ρ , we consider three different splittings between ρ and ζ : ρζ + ρ ∈ { , . , . } . From an EE optimization perspective,any scaling of the power ampliﬁer efﬁciencies is equivalent toan inverse scaling of ζ and ρ . Hence, we can set the efﬁcienciesto ω BS = ω UE = 0 . , which corresponds to , withoutlimiting the generality of our numerical results.The transmit powers in the UL and DL are assumed to beequal and upper bounded by p max = 0 . µ J / channel use. We consider a scenario without interference (i.e., Q H = S = and I UE H = 0 ), the channel covariance matrix R isgenerated by the exponential correlation model in (17) withcorrelation coefﬁcient r = 0 . . We let Nσ tr( R ) = Nσ tr( R ) = p max µ J / channel use which gives an SNR of 20 dB if themaximal transmit power p max is used; recall from SectionIV-C that this SNR is desirable if one should operate closeto the asymptotic capacity limits. To make the EE of theDL and UL equal, we consider a symmetric scenario with α DL = α UL = 0 . and T DLpilot T coher = T ULpilot T coher = 0 . .Fig. 9 shows the achievable DL/UL energy efﬁcienciesusing the lower capacity bounds in Theorem 3. The levels ofimpairments are set to κ BS t = κ BS r = κ UE t = κ UE r = 0 . ; seeRemark 1 for the interpretation of these parameter values. Thetransmit powers are either optimized numerically for maximalEE at each N , ﬁxed at the value that is optimal for N = 1 , orreduced from this value according to the power scaling law inTheorem 5 with t = . Fig. 9 shows that the EE is almost thesame in all three cases, but varies a lot with the circuit powerparameters ζ and ρ . If ρ = 0 , the EE increases monotonicallywith N and eventually converge according to (53) and (54) inCorollary 6. On the contrary, the EE has a unique maximumwhen ρ > and then decreases towards zero. The maximumis in the range of ≤ N ≤ in the ﬁgure, but the exactposition depends on the circuit power parameters. If ρζ + ρ issufﬁciently small we can use a larger N without losing muchin EE. Hence, it is important to make the power that scaleswith N (e.g., the number of extra hardware components andthe computational complexity) as low as possible if massiveMIMO systems should excel in terms of energy efﬁciency.The EE with ideal hardware is also shown in Fig. 9, whichreveals that the difference in EE between ideal and non-idealhardware is small. This is because the performance loss from As a reference, these numbers correspond to 18 W and 0.18 W, respec-tively, for a system with an effective bandwidth of 9 MHz, since then thereare · channel uses per second. As a reference, this number corresponds to 200 mW, or 23 dBm, for asystem with an effective bandwidth of 9 MHz. Number of Base Station Antennas ( N ) E ne r g y E ff i c i en cy [ b i t/ J ou l e ] Ideal: Optimized

Non−Ideal: OptimizedNon−Ideal: t = 0 Non−Ideal: t = 1/2 ζ + ρ = 0 . µJζ + ρ = 2 µJ ρζ + ρ = 0 ρζ + ρ = 0 . ρζ + ρ = 0 . ρζ + ρ = 0 . ρζ + ρ = 0 . ρζ + ρ = 0 Fig. 9: Achievable energy efﬁciency with ideal and non-idealhardware for ﬁxed transmit power ( t = 0 ), transmit powerthat decreases as /N t for t = , and the transmit powerthat maximizes the EE. The EE is computed using the lowerbounds in Theorem 3 and are valid for both DL and ULtransmissions.hardware impairments is relatively small at reasonable numberof antennas.In order to compare the three different power allocations,we show the corresponding transmit powers in Fig. 10. Recallthat the power is either ﬁxed, reduced with N according tothe scaling law in Theorem 5, or optimized for maximal EEat each N . Despite the very similar EEs in Fig. 9, these threepower allocations behave very differently. If ρ = 0 , the optimaltransmit power decreases with N but at a clearly slower pacethan / √ N (which is the fastest power scaling that gives anon-zero asymptotic rate according to Theorem 5). However,the optimal transmit for ρ > only decreases until themaximal EE is achieved (which is in the range of ≤ N ≤ )and then increases with N . This makes much sense, becausewhen the circuit power increases we can also afford using moretransmit power to get a higher spectral efﬁciency. In the casewhen the circuit power is large (i.e., ζ + ρ = 2 µJ ), we see thatit is often optimal to use full transmit power, as representedby the upper straight line. To summarize, the transmit powerin massive MIMO systems can be decreased monotonicallywith N , but this is generally not the way to maximize the EEsince we have ρ > in most practical systems. The loss inEE by decreasing the power appears to be small, but the lossin spectral efﬁciency is naturally larger due to the deﬁnitionof the EE. If we want a simple design rule, it is better to keepthe total power ﬁxed for all N than to decrease it with N .Finally, the ability to increase the levels of impairmentsat the BS with N is illustrated in Fig. 11. We consider thesame symmetric scenario as in the previous two ﬁgures, butthe average SNR is set to 20 dB in the DL and UL. Wehave κ UE t = κ UE r = 0 . at the UE, while the levels ofimpairments at the BS are scaled as κ BS t = κ BS r = 0 . N τ for different τ -values: τ ∈ { , , , , } . The lower capacitybounds are shown in Fig. 11 as a function of N . The −4 − − − Number of Base Station Antennas ( N ) T r an s m i t P o w e r [ µ J / c hanne l u s e ] Ideal: OptimizedNon − Ideal: OptimizedNon − Ideal: t = 0 Non − Ideal: t = 1/2 ζ + ρ = 2 µJζ + ρ = 0 . µJ ρζ + ρ = 0 . ρζ + ρ = 0 ρζ + ρ = 0 ρζ + ρ = 0 . ρζ + ρ = 0 . ρζ + ρ = 0 . Fig. 10: The transmit powers that correspond to the curves inFig. 9. N ) S pe c t r a l E ff i c i en cy [ b i t/ c hanne l u s e ] Fixed Hardware ( τ =0 )Hardware Degradation τ = 1 τ = 2 τ ∈ (cid:31) ,

14 12 (cid:30)

Fig. 11: Lower bounds on the capacity when the levels ofimpairments at the BS are increased with N as N τ for τ ∈ { , , , , } . The results are valid for both DL and ULtransmissions.simulation conﬁrms that the performance degradation is smallwhen the impairment scaling law in Corollary 7 is followed(e.g., for τ = and τ = ). A larger performance loss isobserved for τ = 1 and the curve begins to bend downwardsat N ≈ . In the extreme case of τ = 2 , the lower boundgoes quickly to zero. While these asymptotic results are basedon the lower capacity bounds, we note that whenever τ > the upper capacity bounds in Corollaries 2 and 3 converge tozero as well.VI. E XTENSIONS TO M ULTI -C ELL S CENARIOS

The previous sections focused on a single link of a massiveMIMO system, which experiences interference from otherconcurrent transmissions. Recall that I UE H ∈ C and Q H ∈ C N × N are the conditional covariances of these transmissionsin the DL and UL, respectively, for a given set of channelrealizations H . The asymptotic capacity analysis in SectionIV, particularly the lower capacity bounds in Corollaries 4and 5, are based on the assumptions that E { I UE H } ≤ O ( N n ) and E {(cid:107) Q H (cid:107) } ≤ O ( N n ) for some n < . However, theinterference variance can actually increase faster than this,particularly under the so-called pilot contamination whichgives rise to terms that scale linearly with N [9]–[13]. Thissection investigates the impact of inter-user interference onmassive MIMO systems with non-ideal hardware. The BS andUE from the previous sections are referred to as the ones understudy . A. Inter-User Interference in the Uplink

To exemplify the impact of inter-user interference, weassume that there is a set U of co-users that are scheduled forUL transmission in the current coherence period. Each co-useris served by the BS under study or any of the neighboring BSs,thus the total number of co-users |U| is generally large. Theassociation of UEs to BSs is arbitrary since the association hasno impact on the UE under study in the UL. The block-fadingchannel from UE l ∈ U to the BS under study is modeled as h l ∼ CN ( , R l ) , where R l has bounded spectral norm and thechannel is ergodic and block fading. Recall that H is the setof channel realizations for all channels in the system, thus wehave h l ∈ H for all l ∈ U . The co-user channels are assumedto be independent, which in practice means that users areselected to have no common scatterers [3], [4]—this is a basiccriterion of spatial user separability in the scheduler. A morereﬁned scheduling criteria would be the one in [48], wherethe coverage area is divided into location bins. The users ina bin are roughly equivalent in terms of channel statistics andshould not be served simultaneously. Users in different binshave independent channels and sufﬁciently different spatialproperties, thus selecting one user per location bin for paralleltransmission is a reasonable scheduling decision.The UL pilot signaling is limited to T ULpilot channel usesin the TDD protocol depicted in Fig. 2. Since the numberof active co-users generally satisﬁes |U| > T

ULpilot , each pilotchannel use must be allocated to multiple users. We divide U into two disjoint sets: U (cid:107) are the users that transmit inparallel with the pilot of the UE under study, while U ⊥ arethe remaining users. The co-users in the same cell as the UEunder study are usually in U ⊥ , but this is not necessary. Theinterference vector during UL pilot signaling is ν pilotinterf = (cid:88) l ∈U (cid:107) d l h l (56)where d l is the signal transmitted by UE l ∈ U (cid:107) . Thesesignals can be either deterministic or stochastic, thus someof the interfering transmissions can in principle carry datainstead of pilot signals (cf. Remark 5 in [13]). Assuming E {| d l | } = p UE , the interference covariance matrix duringpilot signaling is S = E (cid:110) ν pilotinterf ( ν pilotinterf ) H (cid:111) = p UE (cid:88) l ∈U (cid:107) R l . (57) Only one pilot channel use is allocated per UE in this section. For otherpilot lengths

B > , one can construct up to B parallel pilot signals thatare orthogonal in space. This increases the pilot power per UE, but does notincrease the total number of orthogonal pilots. The LMMSE estimator in Theorem 1 and the correspondinganalysis in Section III holds for any covariance matrix S ,thus the explicit ways of computing ν pilotinterf and S in (56)–(57) can be plugged in directly. The channel estimate ˆ h isnow correlated with the co-user channels h l for l ∈ U (cid:107) ,which has an important impact on the spectral efﬁciency.More speciﬁcally, the interference vector during UL datatransmission becomes ν datainterf = (cid:88) l ∈U d l h l (58)where d l is the independent zero-mean stochastic data signalsent by UE l ∈ U and has power E {| d l | } = p UE . The condi-tional interference covariance matrix during data transmissionis Q H = E (cid:8) ν datainterf ( ν datainterf ) H |H (cid:9) = p UE (cid:88) l ∈U h l h Hl . (59)Note that Q H depends on the channel realizations in H and has not bounded spectral norm; in fact, there are |U| eigenvalues of Q H that grow without bound as N → ∞ . Thisproperty affects the lower capacity bound in (35) of Theorem3 where the conditional interference term now becomes E (cid:110) ( v UL ) H Q H v UL | ˜ H BS (cid:111) = p UE (cid:88) l ∈U E (cid:110) | h Hl v UL | | ˜ H BS (cid:111) . (60)The following theorem shows how interference terms of thetype E (cid:110) | h Hl v UL | | ˜ H BS (cid:111) in (60) behave as N grows large. Theorem 6.

Assume that no instantaneous CSI is utilized fordecoding (i.e., ˜ H BS = ˜ H UE = ∅ ), interference is treated asnoise, and the receive combining vector v = ˆ h (cid:107) ˆ h (cid:107) . Under theinterference model in (56) – (59) , the terms in (60) are E (cid:8) | h Hl v | (cid:9) (61) =  E (cid:26) p UE (tr( AR l )) tr (cid:0) A ( | d + η UE t | R + Ψ ) A H (cid:1) (cid:27) + O ( √ N ) , l ∈ U (cid:107) , O (1) , l ∈ U ⊥ , where η UE t is stochastic and A , Ψ are given in Theorem 4.The lower capacity bound in (44) is generalized by replac-ing the term O ( N − n ) in the denominator by E (cid:40) tr( R − C )tr (cid:0) A ( | d + η UE t | R + Ψ ) A H (cid:1) (cid:41) (cid:88) l ∈U (cid:107) (cid:18) tr( AR l )tr( AR ) (cid:19) + O (cid:18) N (cid:19) . (62) Proof:

The proof is given in Appendix C-F.This theorem shows that the effective interference from aco-user depends strongly on whether it interfered with the pilottransmission of the UE under study or not. The interferencefrom co-users in U ⊥ , which were silent when the UE understudy sent its pilot, vanishes asymptotically since the userchannels decorrelate with N . This is the classical type ofinterference and is called regular interference in this section.In contrast, the interference from co-users in U (cid:107) , which wereactive during the pilot transmission remains and even scales with N . This is the very essence of pilot contaminated interfer-ence and Theorem 6 generalizes previous results from [9]–[13](among others) to non-ideal hardware. The explanation to thediverse behavior in Theorem 6 is that the channel estimate ˆ h used in the receive combining is independent of the co-userchannels h l for l ∈ U ⊥ , but correlated with h l for l ∈ U (cid:107) sincethese vectors appeared in the interference term (56) duringpilot transmission. Note that Theorem 6 was derived usingMRC, while minimum mean squared error (MMSE) receivecombining is generally a better choice in multi-cell multi-user scenarios since it actively suppresses interference [12].Nevertheless, the theorem establishes the baseline behavior:only the pilot contaminated interference may have a substantialimpact when N is large (if a judicious receive combining isused). The severity of the pilot contamination depends on howthe sets U ⊥ and U (cid:107) are chosen [43]. B. Inter-User Interference in the Downlink

The downlink transmission can also suffer from pilot con-tamination, especially if the numbers of antennas at neigh-boring BSs also grow linearly with N . The conditionalinterference variance in the DL takes a similar form as in(58)–(60): I UE H = p BS (cid:88) l ∈U E (cid:110) | ˜ h Hl v DL l | | ˜ H UE (cid:111) (63)where v DL l is the beamforming vector for DL transmission toUE l ∈ U from its (arbitrary) serving BS and ˜ h l is the channelfrom that BS to the UE under study. For brevity, we will notdive into the details since these require assumptions on thedecision making at other BSs. The general behavior is howeverthe same: UEs with parallel UL pilots cause non-vanishinginterference to each other in the DL, while the impact of allother interfering DL transmissions vanish as N grows large. C. Numerical Illustrations

The impact of inter-user interference and pilot contamina-tion on multi-cell systems with non-ideal hardware is nowstudied numerically. We consider UL scenarios with spatiallyuncorrelated channels, deﬁne the average SNR as p UE tr( R ) Nσ ,and let T ULdata T coher = 0 . be the fraction of UL data transmission.In Fig. 12 we consider the two types of inter-user in-terference from Theorem 6: regular interference from a UEwhose pilot is orthogonal to the UE under study and pilotcontaminated interference from a UE with an overlappingpilot. We want to investigate how the achievable per-userspectral efﬁciency in massive MIMO systems depends onthe strength of the pilot contaminated interference, thus weconsider a scenario where we operate close to the asymptoticlimits: the SNR is 20 dB and the number of antennas is set to N = 200 (see Fig. 6). We consider three levels of impairments: κ UE t = κ BS r ∈ { , . , . } . The lower capacity bounds areshown without interference, with only pilot contaminated in-terference, and with both types of interference. The horizontalaxis in Fig. 12 shows the performance as a function of therelative channel gain of the pilot contaminated interference(with respect to the useful channel). −50 −40 −30 −20 −10 000.511.522.533.544.5 Relative Channel Gain of Pilot Contamination [dB] U p li n k S pe c t r a l E ff i c i en cy [ b i t/ c hanne l u s e ] No InterferenceOnly Pilot Cont .Pilot Con. & Regular Interf. (Relative Gain −10 dB) = 0 . = 0= 0 .

10 log ( κ UE ) κ UE t = κ BS r κ UE t = κ BS r κ UE t = κ BS r t Fig. 12: Lower capacity bounds of a user that experiences pilotcontaminated interference of varying strength and possiblyregular inter-user interference that is − dB weaker than theuseful channel. The interference drowns in the distortion noiseif it is weaker than the level of impairments at the UE.We make several observations. Firstly, the ideal hardwarecase is more sensitive to interference than the non-idealhardware case. This is particularly evident when it comesto regular inter-user interference, which gives a much largerperformance gap in the ideal case. With non-ideal hardware,the regular interference (from a channel that is only − dBweaker than the useful channel) has little impact. This is dueto the large number of antennas, which decorrelate the userchannels. Secondly, the ﬁgure shows that pilot contaminatedinterference has a negligible impact when it arrives over achannel that is much weaker than the useful channel, but thereare breaking points where the degradation effect suddenlybecomes immense. Interestingly, the breaking points are closeto

10 log ( κ UE t ) ; that is, how much weaker the distortionnoise caused by the UE is compared to the useful signal.This is very intuitive if we compare the size of the distortionterm κ UE t E (cid:8) | ϕ | (cid:9) in lower capacity bound in (44) with theinterference term in (62). This is formalized as follows. Corollary 8.

The pilot contaminated interference is negligible,when N grows large, if κ UE t (cid:29) (cid:88) l ∈U (cid:107) (cid:18) tr( AR l )tr( AR ) (cid:19) . (64)This corollary shows that pilot contaminated interferencedrowns in the distortion noise under certain conditions, whichare independent of the absolute SNRs but depend on rela-tive SNR differences of the type tr( AR l ) / tr( AR ) . Sincethe distortion noise typically is – dB weaker than theuseful signal, the same is needed for the pilot contaminatedinterference to make its impact negligible. This is not a bigdeal in cellular deployments; the scheduler should simplyallocate different pilots within each cell and to cell-edge users Cell 9 Cell 10 Cell 11 Cell 12Cell 13 Cell 14 Cell 15 Cell 16 Cell 1 Cell 2Cell 5 Cell 6Cell 9 Cell 10Cell 13 Cell 14Cell 9 Cell 10Cell 13 Cell 14Cell 3 Cell 4Cell 7 Cell 8Cell 11 Cell 12Cell 15 Cell 16Cell 11 Cell 12Cell 15 Cell 16 Cell 1 Cell 2Cell 5 Cell 6Cell 1 Cell 2 Cell 3 Cell 4Cell 5 Cell 6 Cell 7 Cell 8Cell 3 Cell 4Cell 7 Cell 8 Cell 1 Cell 2 Cell 3 Cell 4Cell 5 Cell 6 Cell 7 Cell 8Cell 9 Cell 10 Cell 11 Cell 12Cell 13 Cell 14 Cell 15 Cell 16

Each cell: 1 BS and 6 UEs

Pilot 5Pilot 2

400 meters

Pilot 3Pilot 1Pilot 6 Pilot 4

Fig. 13: Illustration of a multi-cell scenario consisting of 16square cells with wrap-around to avoid edge effects. Each cellis

400 m ×

400 m and contains of 6 UEs equally spaced on acircle of radius

100 m .of neighboring cells. This can be achieved by the pilotallocation algorithm in [43], but also by simple predeﬁnedcell sectorization as illustrated next.Fig. 13 shows an illustration of the realistic multi-cell sce-nario that we use validate Corollary 8. The setup consists of 16square cells, each of size

400 m ×

400 m . To avoid edge effects,we use wrap-around as illustrated in Fig. 13. For simplicity, sixUEs are scheduled per cell using a simple angular sectorizationtechnique; the UEs are equally spaced on a circle of radius

100 m . We assume that orthogonal pilots are allocated to theUEs in each cell, while the same pilots are reused across cellswith the same pattern. The channel covariance matrices areidentity matrices that are scaled by the channel attenuations,which are based on the 3GPP propagation model in [57]: thepath loss is − . /D . where D is the distance in meters.The transmit powers are p UE = 0 . µ J / channel use andthe noise variance is σ = 10 − . µ J / channel use. Thisgives an SNR of 32 dB to the serving BS and 0–13 dB tothe surrounding BSs.Fig. 14 shows the average achievable rates (based on thelower capacity bounds) with MMSE receive combining, whichexploits the estimated intra-cell channels to suppress intra-cell interference. We consider ideal hardware and hardwareimpairments with κ UE t = κ BS r = 0 . . To illustrate the impactof pilot contamination, we compare the inter-cell pilot reusepattern described above with the ideal case when all UEs areallocated unique pilots. We observe that pilot contaminationhas a substantial impact on the ideal hardware case, and therelative loss will continue to increase with N since only thecurve for unique pilots grows towards inﬁnity. In contrast,there is almost no difference between the unique and reused pi-lots cases in the system with non-ideal hardware—particularly As an example, suppose R = δ − . I and R l = δ − . l I where . is thepath loss exponent and δ, δ l are the distances between the BS under study andthe two users. The right-hand side of (64) becomes (tr( AR l ) / tr( AR )) =( δ l /δ ) − . which is in the range − to − dB if UE l is . – . timesfurther away from the BS than the UE under study. This is the case for mostUEs in neighboring cells, but to be sure one can apply a fractional reusepattern such that adjacent cells use different pilots. All interfering UEs willthen be, at least, times further away from the BS than the UE under study. N ) A v e r age S pe c t r a l E ff i c i en cy pe r U s e r [ b i t/ c hanne l u s e ] Ideal: Unique Pilots

Ideal: Inter−Cell Pilot ReuseNon−ideal: Unique PilotsNon−ideal: Inter−Cell Pilot Reuse

Fig. 14: Achievable UL spectral efﬁciency for an average userin the multi-cell scenario depicted in Fig. 13. Each UE haseither a unique pilot signal or the same pilots are reused inevery cell. Pilot contamination degrades the spectral efﬁciencyunder ideal hardware, while the impact on a system withhardware impairments ( κ UE t = κ BS r = 0 . ) is negligible.not when N is large. This implies that pilot contaminationmight have a negligible impact on massive MIMO systemswith hardware impairments, as also shown in Corollary 8. Theexplanation is that the distortion noise at the UE is the mainlimiting factor in the considered scenario, thus regular inter-user interference and the pilot contamination drowns in thedistortions.Informally speaking, the distortion noise acts as a fog thatprevents the BS from seeing distant interferers in the pilottransmission phase. The number of orthogonal pilot signals islimited by T ULpilot within the sight of each BS, but can otherwisebe reused freely. A simple location-based pilot allocation wassufﬁcient for the multi-cell scenario depicted in Fig. 13, butin general we believe that cell center and cell edge UEs needto be treated differently. If one can afford a fractional reusepattern, where adjacent cells never use the same pilots, thenCorollary 8 will be satisﬁed in most cases; see Footnote 18.Finally, we recall from Section IV-C that the gain ofincreasing the number of antennas beyond N = 100 wassmall in the single-user case with non-ideal hardware. Moreantennas can be used in multi-cell scenarios to suppress theregular interference. The convergence to the asymptotic limitis, however, much faster with non-ideal hardware, because it issufﬁcient to suppress the regular interference to a level belowthe distortion noise.VII. R EFINEMENTS OF THE S YSTEM M ODEL AND THE P OSSIBLE I MPLICATIONS

Using the system model deﬁned in Section II, we haveshown how the additive distortion noises from hardware im-pairments limit the estimation accuracy and channel capacities.The practical relevance of the system model has been veriﬁedexperimentally in [15]–[17]. It can also be motivated theo-retically when the impairment characteristics are static within each coherence period (e.g., due to the use of strong compen-sation algorithms). In this case, one can apply the Bussgangtheorem which shows that any nonlinear distortion functionof a Gaussian signal can be reduced to an afﬁne functionwhere the signal is multiplied with an effective channel andcorrupted by uncorrelated Gaussian noise [7]. This resultsin additive distortion noise similar to the one in our systemmodel, but not identical. For analytic tractability, we assumedin the system model of Section II that the distortion noisesare independent of the data signals (not only uncorrelated)and Gaussian distributed (even if the data signals are not);the same assumptions were made in [14]–[19]. If one wouldconsider an alternative model where these two assumptions arenot made, then the lower capacity bounds in this paper willstill hold (because the mutual information is always reducedby adding the two assumptions [35]). The upper bounds inSection IV would not hold without the two assumptions,and new upper bounds can only be derived if we imposealternative assumptions on the exact dependence betweensignal and distortion. In other words, the model in SectionII is a tractable canonical approximation of communicationsystems with hardware imperfections, but it is not a perfectmodel of reality.In general, the time-varying nature of hardware impairmentscannot be completely mitigated, which also give rise to mul-tiplicative distortions that vary within each coherence period.Furthermore, the covariance matrices of the additive distortionnoises, given in (3), (4), (7), and (8), can be reﬁned inseveral ways. This section outlines some possible reﬁnementsof the system model in Section II and how each one isexpected to affect the main results—the exact analysis is notstraightforward and is left for future work. Most of thesemodel reﬁnements will further degrade the performance, thusthe upper capacity bounds in Theorem 2 is typically valid,while the lower capacity bounds need to be reduced.

A. Power Loss

It is difﬁcult to model the total emitted power under non-ideal hardware, because some distortions are created indepen-dently in the hardware, other distortions take their power fromthe useful signals, and some impairment sources (e.g., non-linearities) can even reduce the emitted power. In this paper,we have implicitly assumed that the compensation algorithmsscale the total emitted power such that it equals p BS (1 + κ BS t ) in the DL and p UE (1 + κ UE t ) in the UL. This simpliﬁcationcreates a small bias when comparing systems with differentlevels of impairments, but the simulations in [20, Section4.3] showed that this has a negligible impact on the spectralefﬁciencies. Nevertheless, it is important to note that althoughthe distortion noise caused by the BS vanishes as N → ∞ ,there remains a power loss of κ BS t κ BS t that should be taken intoaccount when designing massive MIMO systems. B. High-Power Scalings

The levels of impairments in the transmitter hardware, κ BS t and κ UE t , were taken as constants in Sections II–IV. This isreasonable when operating within the dynamic/linear range of the respective power ampliﬁers. Outside these ranges, theproportionality coefﬁcients increase rapidly with the transmitpowers p BS and p UE due non-linearities. This behavior wasaccurately modeled by polynomials in [19] and [58]; forexample, κ BS t could have two terms: a constant term describingthe low-power EVM and a term ( p BS /c ) q , for some exponent q , describing the severity/order of the dominating non-linearityand a constant c > that marks the end of the dynamic range[19]. Note that the distortion noise added by the low-noiseampliﬁers in radio receivers will typically not become worsewith the received power, thus it is reasonable to let κ BS r and κ UE r be constants.The consequence of having proportionality coefﬁcients thatscale with the transmit powers is that the distortion noisepower increases faster than the signal power. Hence, thecapacity and estimation accuracy are no longer monotonicallyincreasing in p BS and p UE when using non-ideal hardware—these metrics are instead maximized at some ﬁnite transmitpowers [58]. The reason that we took κ BS t and κ UE t asconstants herein is that the high-power regime is not our mainfocus. Consequently, the high-power limits that we derivedare optimistic and might not be achievable in practice—alternatively, they are the result of decreasing the propagationdistance instead of increasing the actual emitted power. Theresults when N → ∞ are however accurate since the totalpower (or at least the power per antenna) decreases with N ;see the discussion in Section V. C. Alternative Distortion Noise Distributions

The distortion covariance matrices in (3) and (8) are basedon the assumption of independent distortion at the different BSantennas. This implies that the distortion noise has a differentspatial signature than the useful signal, which is the reasonwhy the detrimental impact of the distortion noise caused bythe BS vanishes as N → ∞ . The underlying assumption isthat the hardware chains of different antennas are decoupled.Nevertheless, there can exist cross-correlation since the sameuseful signal is transmitter/received over the array, thus makingthe hardware react similarly. Such correlation was predictedand characterized in [59] but is typically small. Thus, webelieve that also in practice the distortion noise and usefulsignal have different spatial signatures as N grows large.The distortion noises were assumed to be Gaussian dis-tributed (for any ﬁxed channel realization), but this can alsobe relaxed. As can be seen in the appendices, the proofsrely on that the cross-moments between the signal and thedistortion are weak. The independence can, probably, bereplaced with uncorrelation and that the higher-order momentsare sufﬁciently weak, but the corresponding generalized proofswill be rather tedious and the convergence as N → ∞ mightbe slower. D. Multiplicative Distortions

The additive distortion model in this paper has been veriﬁedexperimentally for systems that apply compensation algo-rithms to mitigate the main hardware impairments. It is alsoan accurate model for uncompensated inter-carrier interference caused by phase noise and I/Q imbalance, amplitude-amplitudenonlinearities in power ampliﬁers, and quantization errors[14], [31], [60]. As described in the beginning of SectionVII, hardware impairments also cause channel attenuationsand phase shifts that are multiplied with the channel vector h . If these multiplicative distortions are sufﬁciently static(after compensation), they can be included in the channelvector h by an appropriate scaling of the covariance matrix R or by exploiting that the channel distribution is circularlysymmetric. However, phase noise is a prime example of animpairment that causes multiplicative distortions that drift andaccumulate within the channel coherence period [25], [61]–[63]. We now take a look closer at this type of distortions, toinvestigate in which ways it behaves differently from additivedistortion noise. The actual channel under phase noise can bedescribed as diag( e φ ,t , . . . , e φ N,t ) h , where  = √− is theimaginary unit and { φ i,t } is the stochastic process at the i thchannel element. The phase drift of free-running oscillators iscommonly modeled as a Wiener process φ i,t = φ i,t − + θ BS i,t + θ UE t ∀ i (65)where the initial value is φ i, = 0 since t = 0 denotes the timeof the channel estimation. The innovations that occur t channeluses after the channel estimation are θ BS i,t ∼ N (0 , ∆ BS ) and θ UE t ∼ N (0 , ∆ UE ) at the BS and UE, respectively. Notethat the single-antenna UE’s hardware causes identical driftson all channel elements, while the BS can cause identicalor independent drifts depending on the use of a commonoscillator (CO) or separate oscillators (SOs) at each antennaelement. The phase drifts are temporally white, thus φ i,t ∼N (0 , t ∆ BS + t ∆ UE ) which shows that the variance increaseswith time.To comprehend the impact of phase noise, we note that thesignal part of the received UL signal under MRC, v = ˆ h (cid:107) ˆ h (cid:107) ,is v H diag( e φ ,t , . . . , e φ N,t ) h d ≈ v H h d (cid:124) (cid:123)(cid:122) (cid:125) Ideal signal +  v H diag( φ ,t , . . . , φ N,t ) h d (cid:124) (cid:123)(cid:122) (cid:125) Distortion from phase noise (66)using the Taylor approximation e φ i,t ≈ φ i,t becausethe drifts are small [64], [65]. The ﬁrst term in (66) isthe same as without phase noise, while the second termcharacterizes the mismatch from the phase drift. Since φ i,t has zero mean, the two terms are uncorrelated (irrespective ofif h and d are deterministic or stochastic). We can thereforeobtain a lower bound on the mutual information by treatingthe uncorrelated second term of (66) as independent Gaussiannoise [35, Theorem 1]. By taking the average over channelrealizations, data signals, and phase drifts, the variance of this Since the variance of φ i,t increases linearly with t , the Taylor approxi-mation is only valid for a certain time. The time dependence can however bemitigated by tracking the phase noise within each coherence period [64]. distortion is E (cid:110)(cid:12)(cid:12)  v H diag( φ ,t , . . . , φ N,t ) h d (cid:12)(cid:12) (cid:111) = N (cid:88) i =1 N (cid:88) i =1 E { v ∗ i v i h i h ∗ i } E { φ i ,t φ ∗ i ,t } E {| d | } = p UE E {| v H h | } t ∆ UE + (cid:40) p UE t ∆ BS E {| v H h | } , if CO ,p UE t ∆ BS (cid:80) Ni =1 E {| h i | | v i | } , if SO , (67)where the ﬁrst term originates from the UE and the secondterm is due to the BS having either a CO or SOs at eachantenna. Recall from Theorem 4 that E {| v H h | } = O ( N ) and E {| h i | | v i | } = O (1) . This means that a BS with a COcauses distortion that scales as O ( tN ) , while it only scalesas O ( t ) when having SOs. In other words, it appears tobe preferable to have independent oscillators at each antennaelement in massive MIMO systems, which was also notedin [25]. This is also consistent with our results for additivedistortion noise: impairments at the UE are N times moreinﬂuential on the capacity, thus we can degrade the qualityof the BS’s oscillators with N and only get a minor lossin performance. This property is also positive for distributedmassive MIMO deployments where the antenna separation islarge and prevents the use of a CO. A major difference fromthe additive distortion noise in Section II is that the distortionin (67) increases linearly with the time t , thus it eventuallygrows large and it becomes necessary to send pilot/calibrationsignals more often to mitigate it [25].The narrowband phase-noise analysis above only considereda lower capacity bound, thus it is possible to achieve higherrates. In particular, the analysis assumed uncompensated free-running oscillators, while it might be better to track the phasenoise process at the BS; for example, by using previousreceived signals, extra calibration signals (see [64] and refer-ences therein), and utilizing correlation between subcarriers inmulti-carrier systems. The tracking might be more accurate fora CO since there are O ( N ) observations of a single phase driftparameter, instead of O ( N ) observations for N parametersas with SOs. Another important aspect of phase noise isthat the standard deviations √ ∆ BS and √ ∆ UE are typicallyproportional to the carrier frequency [62], thus phase noisemight be a major challenge in higher frequency bands (e.g.,mmWave) [66]—unless the symbol time is also sufﬁcientlyreduced by increasing the bandwidth. E. Imperfect Channel Reciprocity

The downlink beamforming in massive MIMO TDD sys-tems relies on channel reciprocity; that is, if h is the uplinkchannel then h T is the downlink channel. This property holdsfor the radio-frequency propagation channels, but the end-to-end channels are also affected by the hardware since differenttransceiver chains are used for transmission/reception at theBS and the UE. The actual downlink channel is h T D b where the diagonal matrix D b = diag( b , . . . , b N ) contains Since the useful signal power p UE E {| v H h | } also scales as O ( N ) , therelative distortion power behaves as O ( t ) with CO and O ( t/N ) with SOs. N calibration parameters. These are b i = 1 ∀ i for idealhardware but we generally have b i (cid:54) = 1 ∀ i due to non-idealhardware. The mismatch is fully speciﬁed by b , . . . , b N andfortunately these parameters change slowly with time, thus onecan compute estimates ˆ b , . . . , ˆ b N using a negligible amountof overhead signaling [16] (even in massive MIMO systems[67]). Since the transmit beamforming mainly depends on thechannel direction, it is often sufﬁcient for the BS to computethe downlink channel up to an unknown scaling factor; see[16], [67]–[69] for different techniques that exploit uplinkpilot transmissions. The estimates are naturally imperfect, thus b i = c (ˆ b i + e i ) where e i is the estimation error and c is theunknown common scaling factor.Imperfect channel reciprocity has no impact on the UL andis not expected to change anything fundamentally in the DL.There is a loss in received signal power since the beamformingdirection is perturbed, but there is no extra self-interferencesince all the CSI available at the receiving UE is estimatedin the downlink and thus reﬂects the actual downlink channel h T D b . In other words, the lower capacity bound in (34) is stillvalid if we replace h by D b h everywhere and compute theexpectations with respect to the actual distributions. The beam-forming vector v DL is now a function of diag(ˆ b , . . . , ˆ b N )ˆ h .This perturbation of v DL , as compared to having perfectreciprocity, behaves like a channel estimation error and itsimpact is expected to vanish as N grows large. Moreover, itshould only have a minor impact on the inter-user interferencein multi-cell scenarios, since the reciprocity calibration errorsare independent of the co-user channels.VIII. C ONCLUSION

This paper analyzed the capacity and estimation accuracy ofmassive MIMO systems with non-ideal transceiver hardware.The analysis was based on a new system model that modelsthe hardware impairment at each antenna by an additivedistortion noise that is proportional to the signal power atthis antenna. This model has several attractive features: it ismathematically tractable, it has been veriﬁed experimentallyin previous works, and it can be motivated theoretically insystems that apply compensation algorithms to mitigate thehardware impairments.We proved analytically that hardware impairments createnon-zero estimation error ﬂoors and ﬁnite capacity ceilingsin the uplink and downlink—irrespective of the SNR and thenumber of base station antennas N . This stands in contrast tothe very optimistic asymptotic results previously reported forideal hardware. Despite these discouraging results, we showedthat massive MIMO systems can still achieve a huge arraygain, in the sense that relatively high spectral efﬁciency andenergy efﬁciency can be obtained. Furthermore, we proved thatonly the hardware impairments at the UEs limit the capacitiesas N grows large. This implies that the hardware quality atthe BS can be decreased as N grows, which is an importantinsight and might become a key enabler for future networkdeployments.In multi-cell scenarios, we proved that the detrimental effectof inter-user interference and pilot contamination drowns in the distortion noise if a simple pilot allocation algorithmis used to avoid the strongest forms of pilot contaminatedinterference. Many quantitative conclusions can be drawn fromthe numerical results in Sections III–VI; for example, that thereis little gain in having more than 100 antennas for a single-user link, but additional antennas are useful to suppress inter-user interference in multi-cell scenarios. The asymptotic limitsunder non-ideal hardware are generally reached at much fewerantennas than the asymptotic limits for ideal hardware, whichimplies that we can expect practical systems to beneﬁt fromthe asymptotic results. We also gave a brief description of howthe system model considered in this paper can be reﬁned tomodel hardware impairments in even greater detail and howsuch reﬁnements would affect our results.A PPENDIX AN EW AND O LD R ESULTS ON R ANDOM V ECTORS

Lemma 2. [70, Eq. (2.2)] For invertible matrices B and τ ≥ , it holds that ( B + τ xx H ) − x = B − x τ x H B − x . (68) Lemma 3.

Suppose h ∼ CN (0 , r ) and a, b > , then E (cid:26) | h | a | h | + b (cid:27) = 1 a (cid:18) − bar E (cid:18) bar (cid:19) e bar (cid:19) (69) where E ( x ) = (cid:82) ∞ e − tx t dt denotes the exponential integral.Proof: Since (cid:37) = | h | has the exponential distributionwith mean value r , the expectation in (69) equals (cid:90) ∞ (cid:37)a(cid:37) + b e − (cid:37)/r r d(cid:37) = ba r e bar (cid:90) ∞ (cid:18) − x (cid:19) e − (cid:37)bar dx (70)where the equality follows from a change of variable x = ab (cid:37) + 1 . Straightforward integration and identiﬁcation of theexponential integral yield the right-hand side of (69). Lemma 4.

For any a, b ∈ C and non-zero c, d ∈ C , we have (cid:12)(cid:12)(cid:12)(cid:12) ac − bd (cid:12)(cid:12)(cid:12)(cid:12) ≤ | b | | c − d || c | | d | + | a − b || c | . (71) Proof:

This is straightforward to prove by using that | ad − bc | = | ad − bc + bd − bd | ≤ | b | | c − d | + | d | | a − b | . Lemma 5.

Consider M arbitrary matrices M , . . . , M M ∈ C N × N and an Hermitian positive semi-deﬁnite matrix B ∈ C N × N . It follows that | tr( M · · · M M B ) | ≤ tr( B ) M (cid:89) i =1 (cid:107) M i (cid:107) . (72) If M , . . . , M M , B have uniformly bounded spectral norms,then | tr( M · · · M M B ) | = O ( N ) . (73) Proof:

The bound in (72) follows from that B has non-negative eigenvalues and each matrix M i cannot amplify theseby more than (cid:107) M i (cid:107) . The O ( N ) -scaling follows from (72) byusing the assumptions and tr( B ) ≤ N (cid:107) B (cid:107) . Lemma 6. [71, Lemma B.26] Let B ∈ C N × N be determin-istic and x = [ x . . . x N ] T ∈ C N be a stochastic vector ofindependent entries. Assume that E { x i } = 0 , E {| x i | } = 1 ,and E {| x i | (cid:96) } = χ (cid:96) < ∞ for (cid:96) ≤ q . Then, for any q ≥ , E (cid:110)(cid:12)(cid:12) x H Bx − tr( B ) (cid:12)(cid:12) q (cid:111) ≤ C q (cid:0) tr( BB H ) (cid:1) q (cid:0) χ q + χ q (cid:1) (74) where C q is a constant depending on q only. A PPENDIX BA PPLICATION -R ELATED R ANDOM V ECTOR R ESULTS

Lemma 7.

The channel estimate ˆ h can be decomposed as ˆ h = A (cid:0)(cid:0) ( d + η UE t ) I + D r (cid:1) h + ν (cid:1) (75) where A is deﬁned in (9) and the diagonal matrix D r hasindependent CN (0 , κ BS r p UE ) -entries such that η BS r = D r h .For any realizations of η UE t and D r , the conditional distri-bution is ˆ h | η UE t , D r ∼ CN (cid:0) , A ( Φ + S + σ I ) A H (cid:1) (76) where Φ = (( d + η UE t ) I + D r ) R (( d + η UE t ) I + D r ) H .Proof: This characterization follows directly from Theo-rem 1 and the system model deﬁned in Section II.

Lemma 8.

For the channel h and its estimate ˆ h it holds that E (cid:26)(cid:12)(cid:12)(cid:12) h H ˆ h − (1 + d − η UE t )tr( R − C ) (cid:12)(cid:12)(cid:12) (cid:27) = O ( N ) (77) E (cid:26)(cid:12)(cid:12)(cid:12) | h H ˆ h | − | d − η UE t | tr( R − C ) (cid:12)(cid:12)(cid:12) (cid:27) = O ( N ) . (78) Proof:

Recall that ˆ h = A (cid:0) h ( d + η UE t ) + ν + η BS r (cid:1) for A = d ∗ R ¯ Z − . To prove (77), we expand the argument as (cid:12)(cid:12)(cid:12) h H ˆ h − (1 + d − η UE t )tr( R − C ) (cid:12)(cid:12)(cid:12) ≤ | h H A ν | (79) + 4 | h H A η BS r | + 2 | d + η UE t | (cid:12)(cid:12) h H Ah − d − tr( R − C ) (cid:12)(cid:12) by using the rule | a + b | q ≤ q − ( | a | q + | b | q ) (from H¨older’sinequality) twice. Next, we observe that E {| h H A ν | } = tr (cid:0) A ( S + σ I ) A H R (cid:1) = O ( N ) (80) E {| h H A η BS r | } = κ BS r p UE tr (cid:0) AR diag A H R (cid:1) + κ BS r p UE N (cid:88) i =1 | e Hi RAe i | = O ( N ) (81)where e i is the i th column of an N × N identity matrix. Theexpression (80) follows from the independence of h , ν and(81) follows by straightforward computation using the char-acterization η BS r = D r h in Lemma 7. The O ( N ) -propertiesfollows from Lemma 5 since R , S , A have uniformly boundedspectral norms (by assumption).Since h ∼ R / ˜ h for ˜ h ∼ CN ( , I ) and d − tr( R − C ) =tr( R / AR / ) we can apply Lemma 6 to obtain E (cid:110) | d + η UE t | (cid:12)(cid:12) h H Ah − d − tr( R − C ) (cid:12)(cid:12) (cid:111) ≤ (1 + κ UE t )2 χ C tr (cid:0) ( R − C ) (cid:1) = O ( N ) . (82) We obtain (77) by combining (79)–(82). Expression (78)follows directly, since it is upper bounded similarly to (79).

Lemma 9.

For the channel h and its estimate ˆ h it holds that E (cid:26)(cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − tr (cid:16) A ( | d + η UE t | R + Ψ ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:27) = O ( N ) (83) E (cid:40)(cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − (cid:114) tr (cid:16) A ( | d + η UE t | R + Ψ ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:41) = O (1) (84) where A is deﬁned in (9) and Ψ = p UE κ BS r R diag + S + σ I .Proof: By injecting the term tr( A ( Φ + S + σ I ) A H ) thatappeared in Lemma 7 and using the rule | a + b | q ≤ q − ( | a | q + | b | q ) (from H¨older’s inequality), we bound the left-hand sideof (83) as E (cid:26)(cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − tr (cid:16) A ( | d + η UE t | R + Ψ ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:27) ≤ E (cid:26)(cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − tr (cid:16) A ( Φ + S + σ I ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:27) + 2 E (cid:26)(cid:12)(cid:12)(cid:12) tr (cid:16) A ( Φ + S + σ I ) A H (cid:17) − tr (cid:16) A ( | d + η UE t | R + Ψ ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:27) . (85)The ﬁrst term in (85) satisﬁes E (cid:26)(cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − tr( A ( Φ + S + σ I ) A H ) (cid:12)(cid:12)(cid:12) (cid:27) ≤ C E (cid:8) tr( A ( Φ + S + σ I ) A H A ( Φ + S + σ I ) H A H ) (cid:9) ≤ C (cid:107) A (cid:107) E (cid:8) (cid:107) Φ + S + σ I (cid:107) F (cid:9) = O ( N ) (86)where the ﬁrst inequality follows from applying Lemma 6 on(83) for ﬁxed η UE t , D r (note that the fourth-order moment is χ = 2 for complex Gaussian variables), while the secondinequality follows from applying Lemma 5 twice. The scaling O ( N ) follows since σ is constant, (cid:107) A (cid:107) = O (1) , (cid:107) S (cid:107) F = O ( N ) , E { tr( ΦS ) } ≤ (cid:107) R (cid:107) (cid:107) S (cid:107) E {(cid:107) ( d + η UE t ) I + D r (cid:107) F } = O ( N ) , and E { tr( ΦΦ H ) } ≤ (cid:107) R (cid:107) E {(cid:107) (( d + η UE t ) I + D r )(( d + η UE t ) I + D r ) H (cid:107) F } = O ( N ) using Lemma 5.Next, we characterize the second term in (85) as E (cid:26)(cid:12)(cid:12)(cid:12) tr (cid:16) A ( Φ + S + σ I ) A H (cid:17) − tr (cid:16) A ( | d + η UE t | R + Ψ ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:27) = E (cid:40)(cid:12)(cid:12)(cid:12)(cid:12) tr (cid:16) A (cid:0) D r RD Hr + D r R ( d + η UE t ) ∗ + ( d + η UE t ) RD Hr (cid:1) A H (cid:17) − p UE κ BS r tr (cid:16) AR diag A H (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) (cid:41) ≤ E (cid:26)(cid:12)(cid:12)(cid:12) tr (cid:16) A ( D r RD Hr − p UE κ BS r R diag ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:27) + 4 E (cid:8) | d + η UE t | (cid:9) E (cid:26)(cid:12)(cid:12)(cid:12) tr (cid:16) AD r RA H (cid:17)(cid:12)(cid:12)(cid:12) (cid:27) = O ( N ) . (87) where the equality follows from plugging in the ex-pressions for Φ and Ψ and noting that the terms | d + η UE t | tr( ARA H ) , tr( ASA H ) , and σ tr( AA H ) can-cel out. The inequality follows again from the rule | a + b | q ≤ q − ( | a | q + | b | q ) . The O ( N ) -scaling fol-lows since the ﬁrst term in (87) is upper boundedby ( p UE κ BS r ) tr( AR diag A H AR diag A H ) = O ( N ) usingLemma 6 and some algebra, while E {| tr( AD r RA H ) | } ≤(cid:107) RA H A (cid:107) E {| tr( D r ) | } = O ( N ) using Lemma 5. Theexpression (83) now follows from combining (85)–(87).Finally, the expression (84) is proved as E (cid:40)(cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − (cid:114) tr (cid:16) A ( | d + η UE t | R + Ψ ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:41) (88) ≤ E  (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − tr (cid:16) A ( | d + η UE t | R + Ψ ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) + (cid:114) tr (cid:16) A ( | d + η UE t | R + Ψ ) A H (cid:17)(cid:12)(cid:12)(cid:12)  ≤ E (cid:26)(cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − tr (cid:16) A ( | d + η UE t | R + Ψ ) A H (cid:17)(cid:12)(cid:12)(cid:12) (cid:27) tr (cid:16) AΨA H (cid:17) = O (1) where the ﬁrst inequality follows from the rule ( a − b )( a + b ) = a − b and the second inequality is due to removalof non-zero terms from the denominator. The numeratorscales as O ( N ) and the denominator scales at least linearlywith N because tr (cid:16) AΨA H (cid:17) ≥ λ min ( Ψ )tr (cid:16) AA H (cid:17) ≥ p UE λ min ( Ψ ) λ max ( ¯ Z )tr (cid:16) RR H (cid:17) , where tr (cid:16) R (cid:17) grows lin-early with N (by assumption). Here, λ max ( · ) and λ min ( · ) denotes the largest and smallest eigenvalues of a matrix,respectively. This shows that (84) is bounded and ﬁnalizesthe proof. Lemma 10.

For the estimated channel ˆ h in (9) it holds that E (cid:40) | d − η UE t | k (cid:107) ˆ h (cid:107) (cid:41) ≤ k + 2 k (cid:0) k (cid:1) !( κ UE t ) k λ +min ( B )( N B −

1) = O ( N − ) (89) E (cid:40) | d − η UE t | k (cid:107) ˆ h (cid:107) (cid:41) ≤ k + 2 k (cid:0) k (cid:1) !( κ UE t ) k λ +min ( B ) ( N B − N B −

2) = O ( N − ) (90) for any even integer k , where λ +min ( B ) > denotes thesmallest non-zero eigenvalue of B = σ AA H and N B =rank( B ) .Proof: Using the conditional distribution of the channelestimate in (76), it holds for any integer q > that E (cid:40) | d − η UE t | k (cid:107) ˆ h (cid:107) q (cid:41) = E (cid:40) E (cid:40) | d − η UE t | k (cid:107) ˆ h (cid:107) q (cid:12)(cid:12)(cid:12)(cid:12) η UE t , D r (cid:41)(cid:41) ≤ E (cid:26) k (1 + | d − η UE t | k ) λ +min ( σ AA H ) q (cid:27) E (cid:40) (cid:107) U HB v (cid:107) q (cid:41) (91)where the inequality follows from (cid:107) ˆ h (cid:107) = v H A ( Φ + S + σ I ) A H v ≥ σ v H AA H v ≥ λ +min ( σ AA H ) (cid:107) U HB v (cid:107) where v ∼ CN ( , I ) and U B ∈ C N × N B is an orthogonal ba-sis of the span of AA H (note that this matrix is generally rank-deﬁcient). The expectations were separated since (cid:107) U HB v (cid:107) isindependent of the smallest non-zero eigenvalue. Furthermore,the rule | a + b | q ≤ q ( | a | q + | b | q ) from H¨older’s inequality wasapplied on | d − η UE t | k . Note that U HB v ∼ CN ( , I ) witha dimension reduced from N to N B .The ﬁnal result in (89)–(90) follows from E {| d − η UE t | k } = (cid:0) k (cid:1) !( κ UE t ) k and that [72, Lemma 2.10] with m = 1 and n = N B gives E (cid:40) (cid:107) U HB v (cid:107) q (cid:41) = (cid:40) N B − , if q = 1 , N B − N B − , if q = 2 , (92)and that N B scales linearly with N (see Section II).A PPENDIX CC OLLECTION OF P ROOFS

A. Proof of Lemma 1

The DL capacity in (20) is upper bounded as C DL ≤ T DLdata T coher E (cid:26) max w ( h ) : (cid:107) w (cid:107) =1 log (1 + SINR ( w )) (cid:27) (93)where SINR ( w ) = | h T w | κ BS t N (cid:80) i =1 | h i w i | + κ UE r | h T w | + σ p BS (94)by assuming that the interference part of n is somehowcanceled, perfect CSI is available, and exploiting the corre-sponding optimality of single-stream Gaussian signaling [2],[6]. We can write (94) as SINR ( w ) = w H h ∗ h T ww H (cid:0) κ BS t D | h | + κ UE r h ∗ h T + σ p BS I (cid:1) w (95)by utilizing w H w = 1 . Since the logarithm is a monotonicallyincreasing function, the maximization in (93) can be appliedonto SINR ( w ) . Using (95), this optimization is a generalizedRayleigh quotient problem and thus solved by w = ( κ BS t D | h | + κ UE r h ∗ h T + σ p BS I ) − h ∗ (cid:13)(cid:13)(cid:0) κ BS t D | h | + κ UE r h ∗ h T + σ p BS I (cid:1) − h ∗ (cid:13)(cid:13) (96)which is equivalent to (24) by using Lemma 2. The DLcapacity bound in (22) follows from plugging (96) into (95)(we also took the complex conjugate of the real-valued SINRexpression to make it more consistent with the UL).The UL capacity bound in (23) follows from [6] and byassuming that the interference part of ν is somehow canceled.We note that the uplink SINR with a receive combining vector w is w H hh H ww H (cid:0) κ UE t hh H + κ BS r D | h | + σ p UE I (cid:1) w . (97)The receiver combining vector in (25) maximizes (97) andachieves the upper bound in (23). B. Proof of Theorem 2

The DL capacity bound in (22) can be rewritten as T DLdata T coher E  log  h H (cid:0) κ BS t D | h | + σ p BS I (cid:1) − h κ UE r h H (cid:0) κ BS t D | h | + σ p BS I (cid:1) − h  (98)using Lemma 2. This expression has the structure m ( ψ ) =log (1 + ψ κ UE r ψ ) where ψ = h H ( κ BS t D | h | + σ p BS I ) − h .Since m ( ψ ) is a concave function of ψ , we apply Jensen’sinequality to achieve a new upper bound C DL ≤ T DLdata T coher E { m ( ψ ) } ≤ T DLdata T coher m ( E { ψ } ) . (99)The upper bound in (26) follows from evaluating E { ψ } as E { ψ } = E (cid:26) h H ( κ BS t D | h | + σ p BS I ) − h (cid:27) = N (cid:88) i =1 E  | h i | κ BS t | h i | + σ p BS  = G DL (100)where the expression for G DL is obtained from Lemma 3 using a = κ BS t and b = σ p BS .The closed-form upper bound on the UL capacity in (27) isderived analogously to the DL capacity bound. C. Proof of Theorem 4

We introduce the notation (cid:112) tr( R − C ) ϕ = ϑ √ γ where ϑ = (1 + d − η UE t )tr( R − C ) (101) γ = tr (cid:0) A ( | d + η UE t | R + Ψ ) A H (cid:1) . (102)Starting with the equivalence in (38), we use the rule a − b = ( a + b )( a − b ) to obtain (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:40) h H ˆ h (cid:107) ˆ h (cid:107) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − (cid:12)(cid:12)(cid:12)(cid:12) E (cid:26) ϑ √ γ (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:40) h H ˆ h (cid:107) ˆ h (cid:107) − ϑ √ γ (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:40) h H ˆ h (cid:107) ˆ h (cid:107) + ϑ √ γ (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h H ˆ h (cid:107) ˆ h (cid:107) − ϑ √ γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:40) h H ˆ h (cid:107) ˆ h (cid:107) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) E (cid:26) ϑ √ γ (cid:27)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) . (103)In order to prove the ﬁrst part of the theorem, we mustshow that right-hand side of (103) behaves as O ( √ N ) . UsingCauchy-Schwartz inequality and that γ ≥ tr( AΨA H ) , wehave (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:40) h H ˆ h (cid:107) ˆ h (cid:107) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) E (cid:26) ϑ √ γ (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) ≤ E {(cid:107) h (cid:107) } + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E  ϑ (cid:113) tr (cid:0) AΨA H (cid:1) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O ( √ N ) (104)where E {(cid:107) h (cid:107) } = O ( √ N ) and the second term is boundedin the same way since E { ϑ } = tr( R − C ) = O ( N ) and tr (cid:0) AΨA H (cid:1) grows at least linearly with N (see theproof of Lemma 9). Hence, it remains to prove that the E (cid:110)(cid:12)(cid:12)(cid:12) h H ˆ h / (cid:107) ˆ h (cid:107) − ϑ/ √ γ (cid:12)(cid:12)(cid:12)(cid:111) = O (1) . To this end, we expandthe expression using Lemma 4: E (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h H ˆ h (cid:107) ˆ h (cid:107) − ϑ √ γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) ≤ E  (cid:12)(cid:12)(cid:12) h H ˆ h − ϑ (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107)  + tr( R − C ) E  | d − η UE t | (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) −√ γ (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) √ γ  . (105)The ﬁrst term in (105) is asymptotically bounded since E (cid:40) | h H ˆ h − ϑ |(cid:107) ˆ h (cid:107) (cid:41) ≤ (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) E (cid:110) | h H ˆ h − ϑ | (cid:111)(cid:124) (cid:123)(cid:122) (cid:125) ( a ) = O ( N ) E (cid:40) (cid:107) ˆ h (cid:107) (cid:41)(cid:124) (cid:123)(cid:122) (cid:125) ( b ) = O ( N − ) = O (1) (106)where the expectation of the numerator and denominator areseparated using H¨older’s inequality, ( a ) follows from Lemma8, and ( b ) from Lemma 10 with k = 0 . The second term of(105) is upper bounded by tr( R − C ) (cid:124) (cid:123)(cid:122) (cid:125) = O ( N ) (cid:118)(cid:117)(cid:117)(cid:116) E (cid:40) | d − η UE t | (cid:107) ˆ h (cid:107) γ (cid:41)(cid:124) (cid:123)(cid:122) (cid:125) = O ( N − ) (cid:115) E (cid:26)(cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − √ γ (cid:12)(cid:12)(cid:12) (cid:27)(cid:124) (cid:123)(cid:122) (cid:125) = O (1) = O (1) (107)where H¨older’s inequality was used to separate the expecta-tions. The scaling of the ﬁrst square root follows from Lemma10 and that γ ≤ AΨA H ) = O ( N − ) for any realizationof η UE t . The scaling of the second square root follows fromLemma 9. By plugging these scaling expressions into (103),we have proved (38).Similarly, the equivalence in (39) follows if E (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | h H ˆ h | (cid:107) ˆ h (cid:107) − | d − η UE t | (tr( R − C )) γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) ≤ (cid:0) tr( R − C ) (cid:1) E  | d − η UE t | (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − γ (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) γ  + E  (cid:12)(cid:12)(cid:12) | h H ˆ h | −| d − η UE t | (tr( R − C )) (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107)  (108)scales as O ( √ N ) , where the inequality follows from Lemma4. By applying H¨older’s inequality on the ﬁrst term, we obtain (tr( R − C )) tr( AΨA H ) (cid:124) (cid:123)(cid:122) (cid:125) ( a ) = O ( N ) (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) E (cid:40) | d − η UE t | (cid:107) ˆ h (cid:107) (cid:41)(cid:124) (cid:123)(cid:122) (cid:125) ( b ) = O ( N − ) (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) E (cid:26)(cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − γ (cid:12)(cid:12)(cid:12) (cid:27)(cid:124) (cid:123)(cid:122) (cid:125) ( c ) = O ( N ) = O (cid:16) √ N (cid:17) (109)where ( a ) follows from γ ≤ AΨA H ) = O ( N − ) which isa deterministic upper bound, ( b ) is characterized by Lemma 10, and ( c ) follows from Lemma 9. The second term behavesas (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) E (cid:26)(cid:12)(cid:12)(cid:12) | h H ˆ h | − | d − η UE t | tr( R − C ) (cid:12)(cid:12)(cid:12) (cid:27)(cid:124) (cid:123)(cid:122) (cid:125) ( d ) = O ( N ) × (cid:118)(cid:117)(cid:117)(cid:117)(cid:117)(cid:117)(cid:116) E (cid:40) | h H ˆ h | (cid:107) ˆ h (cid:107) (cid:41)(cid:124) (cid:123)(cid:122) (cid:125) ( e ) = O (1) + E (cid:40) | d − η UE t | (tr( R − C )) (cid:107) ˆ h (cid:107) (cid:41)(cid:124) (cid:123)(cid:122) (cid:125) ( f ) = O (1) = O ( √ N ) (110)by using the rule a − b = ( a + b )( a − b ) and H¨older’s in-equality. ( d ) is characterized by Lemma 8 and ( f ) by Lemma10. Moreover, ( e ) follows since | h H ˆ h | (cid:107) ˆ h (cid:107) ≤ (cid:107) h (cid:107) (cid:107) ˆ h (cid:107) = O (1) byCauchy-Schwartz inequality and that (cid:107) h (cid:107) , (cid:107) ˆ h (cid:107) have sameasymptotic scaling. By plugging these scaling expressions into(108), we have proved (39).Finally, the equivalence in (40) follows since (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) i =1 E {| h i | | v i | } − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:40) (cid:107) ˆ h (cid:107) (cid:107) ˆ h (cid:107) N (cid:88) i =1 | h i | | ˆ h i | (cid:107) ˆ h (cid:107) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E  (cid:107) ˆ h (cid:107) (cid:107) ˆ h (cid:107) (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i =1 | h i | (cid:118)(cid:117)(cid:117)(cid:116) N (cid:88) i =1 | ˆ h i | (cid:107) ˆ h (cid:107) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:40) (cid:107) ˆ h (cid:107) (cid:107) ˆ h (cid:107) (cid:107) h (cid:107) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( c ) ≤ (cid:118)(cid:117)(cid:117)(cid:116)(cid:114) E {(cid:107) h (cid:107) } E (cid:110) (cid:107) ˆ h (cid:107) (cid:111) E (cid:40) (cid:107) ˆ h (cid:107) (cid:41) ( d ) = O (1) (111)where ( a ) follows from | v i | = | ˆ h i | (cid:107) ˆ h (cid:107) and by insertingthe L -norms (cid:107) ˆ h (cid:107) . The reason for this is that the vector [ | ˆ h | . . . | ˆ h N | ] T / (cid:107) ˆ h (cid:107) now has unit L -norm, thus we ap-ply Cauchy-Schwarz inequality in ( b ) to bound the sum by (cid:107) h (cid:107) . Next, ( c ) is obtained by applying H¨older’s inequalitytwice and ( d ) follows from that E {(cid:107) h (cid:107) } = O ( N ) and E {(cid:107) ˆ h (cid:107) } = O ( N ) and E (cid:110) (cid:107) ˆ h (cid:107) (cid:111) = O ( N − ) from Lemma10. D. Proof of Theorem 5

The bounds in this theorem are derived using the capacitylower bounds in Corollaries 4 and 5. We begin with the DLand note that the arguments of the expectations in (42) havedeterministic upper bounds since | ϕ | ≤ (cid:115) tr( R − C ) p UE tr( ARA H ) (112)for any realization of η UE t . The dominated convergence the-orem implies that we can take the limit N → ∞ inside theexpectations. Next, we observe that scaling the pilot power p UE proportionally to /N t UE for some t UE > means that To be strict, we ﬁrst should multiply all terms in (42) by p UE . N t UE p UE → B as N → ∞ for some < B < ∞ . Therefore,we have N t UE tr( R − C ) = N t UE p UE tr (cid:0) R ¯ Z − R (cid:1) → B tr( R ( S + σ I ) − R ) , (113) N t UE tr( A ( | d + η UE t | R + Ψ ) A H ) → B tr( R ( S + σ I ) − R ) , (114)as N → ∞ . Using (113) and (114) and the dominatedconvergence theorem we obtain lim N →∞ | E { ϕ }| (115) = (cid:12)(cid:12)(cid:12)(cid:12) E (cid:26) (1+ d − η UE t ) (cid:112) B tr ( R ( S + σ I ) − R ) (cid:112) B tr ( R ( S + σ I ) − R ) (cid:27)(cid:12)(cid:12)(cid:12)(cid:12) = 1lim N →∞ E {| ϕ | } (116) = E (cid:26) | d − η UE t | B tr (cid:0) R ( S + σ I ) − R (cid:1) B tr ( R ( S + σ I ) − R ) (cid:27) = 1 + κ UE t which holds for any t UE > . The goal is to make theinterference term E { I UE H } p BS tr( R − C ) = E { I UE H } p UE p BS tr( R ¯ Z − R ) vanishasymptotically under the assumption that E { I UE H } = O (1) ,which is achieved if the denominator grows to inﬁnity with N . We note that tr( R ¯ Z − R ) ≥ (cid:107) R (cid:107) (cid:107) ¯ Z (cid:107) scales at least linearlywith N . Hence, the product p UE p BS must reduce at a slowerpace than linear with N , which implies t BS + t UE = t sum < .Finally, we need the O (1 / √ N ) terms in (51) to still vanishas N → ∞ . Some careful but lengthy algebra reveals thatthe O ( N ) properties in Lemmas 9–10 become O ( N − t UE ) .The term O (1 / √ N ) in the numerator of (51) becomes O (1 /N − t UE2 ) while the O (1 / √ N ) in the denominator be-comes O (1 /N − t UE ) . These terms vanish if t UE < , whichﬁnishes the proof for the DL.The proof for the UL is analogous since the uplink capacitybound in (44) has the same structure and contains the sameexpectations as the downlink capacity. E. Proof of Corollary 7

Recall from the proof of Theorem 5 that the dominatedconvergence theorem can be applied, which means that wecan take the limit N → ∞ inside the expectations in the DLcapacity bound of (42) and UL capacity bound of (44). If κ BS t and κ BS r grow with N , we obtain lim N →∞ | E { ϕ }| = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:40) (1 + d − η UE t ) (cid:113) tr( RR − R ) (cid:113) tr( RR − R ) (cid:41)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 (117) lim N →∞ E {| ϕ | } = E (cid:26) | d − η UE t | tr( RR − R )tr( RR − R ) (cid:27) = 1 + κ UE t . (118)In the DL, we further note that κ BS t N (cid:80) i =1 E {| h i | | v i | } tr( R − C ) = O ( κ BS t κ BS r N ) (119) since κ BS r tr( R − C ) → tr( RR − R ) = O ( N ) as N →∞ . If this term should vanish asymptotically, it is sufﬁcientthat κ BS t κ BS r N → which corresponds to the condition in thecorollary. The corresponding condition for the UL is obtainedanalogously and gives ( κ BS r ) N → .Finally, we note that the noise terms (for n ≤ ) andthe O ( √ N ) terms in (42) and (44) all behave as O ( κ BS r √ N ) orsmaller, after some straightforward but lengthy algebra. Theseterms thus vanish under the condition τ r < stated in thecorollary. F. Proof of Theorem 6

The interference expressions in (61) are proved similar toTheorem 4. For the case l ∈ U (cid:107) we have E (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | h Hl ˆ h | (cid:107) ˆ h (cid:107) − p UE a l γ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) ≤ E  (cid:12)(cid:12)(cid:12) | h Hl ˆ h | − p UE a l (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107)  + E  p UE a l (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − γ (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) γ  = O ( √ N ) (120)where γ = tr (cid:0) A ( | d + η UE t | R + Ψ ) A H (cid:1) and a l = tr( AR l ) .This follows since the ﬁrst term in (120) equals E  (cid:12)(cid:12)(cid:12) | h Hl ˆ h | − (cid:112) p UE a l (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) | h Hl ˆ h | + (cid:112) p UE a l (cid:12)(cid:12)(cid:12) (cid:107) ˆ h (cid:107)  ≤ (cid:115) E (cid:26)(cid:12)(cid:12)(cid:12) | h Hl ˆ h | − (cid:112) p UE a l (cid:12)(cid:12)(cid:12) (cid:27)(cid:124) (cid:123)(cid:122) (cid:125) = O ( √ N ) (cid:118)(cid:117)(cid:117)(cid:116) E (cid:40) (cid:107) h l (cid:107) (cid:107) ˆ h (cid:107) + p UE a l (cid:107) ˆ h (cid:107) (cid:41)(cid:124) (cid:123)(cid:122) (cid:125) = O (1) = O ( √ N ) (121)by using H¨older’s inequality, Lemma 6, Cauchy-Schwartzinequality, and Lemma 10. The second term in (120) isalso upper bounded by O ( √ N ) by using H¨older’s inequalityand that a l = O ( N ) , E (cid:8)(cid:12)(cid:12) (cid:107) ˆ h (cid:107) − γ (cid:12)(cid:12) (cid:9) = O ( N ) fromLemma 9, E {(cid:107) ˆ h (cid:107) − } = O ( N − ) from Lemma 10, and γ ≤ AΨA H ) = O ( N − ) .Next, the case l ∈ U ⊥ in (61) follows from E (cid:8) | h Hl v UL | (cid:9) = E (cid:8) ( v UL ) H R l v UL (cid:9) ≤ (cid:107) R l (cid:107) = O (1) since h l and v UL are independent.Finally, we note that the noise term in the denominator of(44) would be E { v H Q H v } p UE tr( R − C ) = (cid:88) l ∈U (cid:107) p UE E (cid:8) | h Hl v | (cid:9) p UE tr( R − C ) + O (cid:32)(cid:114) N (cid:33) (122)where the ﬁrst term is equal to (62) by exploiting (61) and tr( R − C ) = (cid:112) p UE tr( AR ) .A CKNOWLEDGMENT

The authors would like to thank Romain Couillet and theanonymous reviewers for indispensable feedback on this paper. R EFERENCES[1] E. Bj¨ornson, J. Hoydis, M. Kountouris, and M. Debbah, “Hardwareimpairments in large-scale MISO systems: Energy efﬁciency, estimation,and capacity limits,” in

Proc. Int. Conf. Digital Signal Process. (DSP) ,2013.[2] E. Telatar, “Capacity of multi-antenna Gaussian channels,”

EuropeanTrans. Telecom. , vol. 10, no. 6, pp. 585–595, 1999.[3] X. Gao, O. Edfors, F. Rusek, and F. Tufvesson, “Linear pre-codingperformance in measured very-large MIMO channels,” in

Proc. IEEEVTC Fall , 2011.[4] J. Hoydis, C. Hoek, T. Wild, and S. ten Brink, “Channel measurementsfor large antenna arrays,” in

Int. Symp. Wireless Commun. Systems(ISWCS) , 2012.[5] M. Medard, “The effect upon channel capacity in wireless communica-tions of perfect and imperfect knowledge of the channel,”

IEEE Trans.Inf. Theory , vol. 46, no. 3, pp. 933–946, 2000.[6] E. Bj¨ornson, P. Zetterberg, M. Bengtsson, and B. Ottersten, “Capacitylimits and multiplexing gains of MIMO channels with transceiverimpairments,”

IEEE Commun. Lett. , vol. 17, no. 1, pp. 91–94, 2013.[7] W. Zhang, “A general framework for transmission with transceiverdistortion and some applications,”

IEEE Trans. Commun. , vol. 60, no. 2,pp. 384–399, 2012.[8] A. M¨uller, A. Kammoun, E. Bj¨ornson, and M. Debbah, “Linearprecoding based on polynomial expansion: Reducing complexityin massive MIMO,”

IEEE Trans. Inf. Theory , submitted. [Online].Available: http://arxiv.org/abs/1310.1806[9] F. Rusek, D. Persson, B. Lau, E. Larsson, T. Marzetta, O. Edfors, andF. Tufvesson, “Scaling up MIMO: Opportunities and challenges withvery large arrays,”

IEEE Signal Process. Mag. , vol. 30, no. 1, pp. 40–60, 2013.[10] T. Marzetta, “Noncooperative cellular wireless with unlimited numbersof base station antennas,”

IEEE Trans. Wireless Commun. , vol. 9, no. 11,pp. 3590–3600, 2010.[11] J. Jose, A. Ashikhmin, T. Marzetta, and S. Vishwanath, “Pilot con-tamination and precoding in multi-cell TDD systems,”

IEEE Trans.Commun. , vol. 10, no. 8, pp. 2640–2651, 2011.[12] J. Hoydis, S. ten Brink, and M. Debbah, “Massive MIMO in the UL/DLof cellular networks: How many antennas do we need?”

IEEE J. Sel.Areas Commun. , vol. 31, no. 2, pp. 160–171, 2013.[13] H. Ngo, E. Larsson, and T. Marzetta, “Energy and spectral efﬁciency ofvery large multiuser MIMO systems,”

IEEE Trans. Commun. , vol. 61,no. 4, pp. 1436–1449, 2013.[14] T. Schenk,

RF Imperfections in High-Rate Wireless Systems: Impact andDigital Compensation . Springer, 2008.[15] C. Studer, M. Wenk, and A. Burg, “MIMO transmission with residualtransmit-RF impairments,” in

Proc. ITG/IEEE Workshop on SmartAntennas (WSA) , 2010.[16] P. Zetterberg, “Experimental investigation of TDD reciprocity-basedzero-forcing transmit precoding,”

EURASIP J. on Adv. in Signal Pro-cess. , Jan. 2011.[17] M. Wenk,

MIMO-OFDM Testbed: Challenges, Implementations, andMeasurement Results , ser. Series in microelectronics. Hartung-Gorre,2010.[18] B. G¨oransson, S. Grant, E. Larsson, and Z. Feng, “Effect of transmitterand receiver impairments on the performance of MIMO in HSDPA,” in

Proc. IEEE Int. W. Signal Process. Adv. Wireless Commun. (SPAWC) ,2008.[19] E. Bj¨ornson, P. Zetterberg, and M. Bengtsson, “Optimal coordinatedbeamforming in the multicell downlink with transceiver impairments,”in

Proc. IEEE Global Commun. Conf. (GLOBECOM) , 2012.[20] E. Bj¨ornson and E. Jorswieck, “Optimal resource allocation in coordi-nated multi-cell systems,”

Foundations and Trends in Communicationsand Information Theory , vol. 9, no. 2-3, pp. 113–381, 2013.[21] A. Mezghani, N. Damak, and J. A. Nossek, “Circuit aware design ofpower-efﬁcient short range communication systems,” in

Proc. Int. Symp.Wireless Commun. Systems (ISWCS) , 2010, pp. 869–873.[22] J. Qi and S. Aissa, “On the power ampliﬁer nonlinearity in MIMOtransmit beamforming systems,”

IEEE Trans. Commun. , vol. 60, no. 3,pp. 876–887, 2012.[23] B. Maham and O. Tirkkonen, “Transmit antenna selection OFDMsystems with transceiver I/Q imbalance,”

IEEE Trans. Veh. Technol. ,vol. 61, no. 2, pp. 865–871, 2012.[24] T. Koch, A. Lapidoth, and P. Sotiriadis, “Channels that heat up,”

IEEETrans. Inf. Theory , vol. 55, no. 8, pp. 3594–3612, 2009. [25] A. Pitarokoilis, S. Mohammed, and E. Larsson, “Uplink performanceof time-reversal MRC in massive MIMO systems subject to phasenoise,”

IEEE Trans. Wireless Commun. , submitted. [Online]. Available:http://arxiv.org/abs/1306.4495[26] C. Studer and E. Larsson, “PAR-aware large-scale multi-user MIMO-OFDM downlink,”

IEEE J. Sel. Areas Commun. , vol. 31, no. 2, pp.303–313, 2013.[27] S. Mohammed and E. Larsson, “Per-antenna constant envelope precod-ing for large multi-user MIMO systems,”

IEEE Trans. Commun. , vol. 61,no. 3, pp. 1059–1071, 2013.[28] G. Caire, N. Jindal, M. Kobayashi, and N. Ravindran, “Multiuser MIMOachievable rates with downlink training and channel state feedback,”

IEEE Trans. Inf. Theory , vol. 56, no. 6, pp. 2845–2866, 2010.[29] E. Bj¨ornson, M. Kountouris, M. Bengtsson, and B. Ottersten, “Receivecombining vs. multi-stream multiplexing in downlink systems withmulti-antenna users,”

IEEE Trans. Signal Process. , vol. 61, no. 13, pp.3431–3446, 2013.[30] S. Cui, A. Goldsmith, and A. Bahai, “Energy-constrained modulationoptimization,”

IEEE Trans. Wireless Commun. , vol. 4, no. 5, pp. 2349–2360, 2005.[31] H. Holma and A. Toskala,

LTE for UMTS: Evolution to LTE-Advanced ,2nd ed. Wiley, 2011.[32] S. Kay,

Fundamentals of Statistical Signal Processing: EstimationTheory . Prentice Hall, 1993.[33] J. Kotecha and A. Sayeed, “Transmit signal design for optimal estimationof correlated MIMO channels,”

IEEE Trans. Signal Process. , vol. 52,no. 2, pp. 546–557, 2004.[34] E. Bj¨ornson and B. Ottersten, “A framework for training-based esti-mation in arbitrarily correlated Rician MIMO channels with Riciandisturbance,”

IEEE Trans. Signal Process. , vol. 58, no. 3, pp. 1807–1820, 2010.[35] B. Hassibi and B. Hochwald, “How much training is needed in multiple-antenna wireless links?”

IEEE Trans. Inf. Theory , vol. 49, no. 4, pp.951–963, 2003.[36] N. O’Donoughue and J. Moura, “On the product of independent complexGaussians,”

IEEE Trans. Signal Process. , vol. 60, no. 3, pp. 1050–1063,2012.[37] J. Hoydis, M. Kobayashi, and M. Debbah, “Optimal channel training inuplink network MIMO systems,”

IEEE Trans. Signal Process. , vol. 59,no. 6, pp. 2824–2833, 2011.[38] S. Wagner, R. Couillet, M. Debbah, and D. Slock, “Large systemanalysis of linear precoding in MISO broadcast channels with limitedfeedback,”

IEEE Trans. Inf. Theory , vol. 58, no. 7, pp. 4509–4537, 2012.[39] S. Loyka, “Channel capacity of MIMO architecture using the exponentialcorrelation matrix,”

IEEE Commun. Lett. , vol. 5, no. 9, pp. 369–371,2001.[40] E. Bj¨ornson, D. Hammarwall, and B. Ottersten, “Exploiting quantizedchannel norm feedback through conditional statistics in arbitrarily cor-related MIMO systems,”

IEEE Trans. Signal Process. , vol. 57, no. 10,pp. 4027–4041, 2009.[41] T. Yoo and A. Goldsmith, “Capacity and power allocation for fadingMIMO channels with channel estimation error,”

IEEE Trans. Inf. Theory ,vol. 52, no. 5, pp. 2203–2214, 2006.[42] D. Shiu, G. Foschini, M. Gans, and J. Kahn, “Fading correlation and itseffect on the capacity of multielement antenna systems,”

IEEE Trans.Commun. , vol. 48, no. 3, pp. 502–513, 2000.[43] H. Yin, D. Gesbert, M. Filippou, and Y. Liu, “A coordinated approachto channel estimation in large-scale multiple-antenna systems,”

IEEE J.Sel. Areas Commun. , vol. 31, no. 2, pp. 264–273, 2013.[44] A. Adhikary, J. Nam, J.-Y. Ahn, and G. Caire, “Joint spatial division andmultiplexing—the large-scale array regime,”

IEEE Trans. Inf. Theory ,vol. 59, no. 10, pp. 6441–6463, 2013.[45] R. Couillet and M. Debbah,

Random Matrix Methods for WirelessCommunications . Cambridge University Press, 2011.[46] R. Couillet, F. Pascal, and J. W. Silverstein, “Robust estimates ofcovariance matrices in the large dimensional regime and applicationto array processing,”

IEEE Trans. Inf. Theory , to appear. [Online].Available: http://arxiv.org/abs/1204.5320[47] N. Shariati, E. Bj¨ornson, M. Bengtsson, and M. Debbah, “Low-complexity polynomial channel estimation in large-scale MIMO witharbitrary statistics,”

IEEE J. Sel. Topics Signal Process. , 2014, to appear.[48] H. Huh, G. Caire, H. Papadopoulos, and S. Ramprashad, “Achieving“massive MIMO” spectral efﬁciency with a not-so-large number ofantennas,”

IEEE Trans. Wireless Commun. , vol. 11, no. 9, pp. 3226–3239, 2012. [49] G. Caire and S. Shamai, “On the capacity of some channels with channelstate information,” IEEE Trans. Inf. Theory , vol. 45, no. 6, pp. 2007–2019, 1999.[50] G. Auer, V. Giannini, C. Desset, I. Godor, P. Skillermark, M. Olsson,M. Imran, D. Sabella, M. Gonzalez, O. Blume, and A. Fehske, “Howmuch energy is needed to run a wireless network?”

IEEE WirelessCommun. Mag. , vol. 18, no. 5, pp. 40–49, 2012.[51] J. Xu, L. Qiu, and C. Yu, “Improving energy efﬁciency throughmultimode transmission in the downlink MIMO systems,”

EURASIPJ. Wirel. Commun. Netw. , 2011.[52] D. Ng, E. Lo, and R. Schober, “Energy-efﬁcient resource allocation inOFDMA systems with large numbers of base station antennas,”

IEEETrans. Wireless Commun. , vol. 11, no. 9, pp. 3292–3304, 2012.[53] H. Yang and T. Marzetta, “Total energy efﬁciency of cellular large scaleantenna system multiple access mobile networks,” in

Proc. IEEE OnlineConference on Green Communications (OnlineGreenComm) , 2013.[54] E. Bj¨ornson, M. Kountouris, and M. Debbah, “Massive MIMO and smallcells: Improving energy efﬁciency by optimal soft-cell coordination,” in

Proc. Int. Conf. Telecommun. (ICT) , 2013.[55] S. Verd´u, “On channel capacity per unit cost,”

IEEE Trans. Inf. Theory ,vol. 36, no. 5, pp. 1019–1030, 1990.[56] G. Auer, O. Blume, V. Giannini, I. Godor, M. Imran, Y. Jading,E. Katranaras, M. Olsson, D. Sabella, P. Skillermark, and W. Wajda,

D2.3: Energy efﬁciency analysis of the reference systems, areas ofimprovements and target breakdown . INFSO-ICT-247733 EARTH,ver. 2.0, 2012.[57]

Further advancements for E-UTRA physical layer aspects (Release 9) .3GPP TS 36.814, Mar. 2010.[58] C. Studer, M. Wenk, and A. Burg, “System-level implications ofresidual transmit-RF impairments in MIMO systems,” in

Proc. EuropeanConf. Antennas and Propagation (EuCAP) , 2011.[59] N. Moghadam, P. Zetterberg, P. H¨andel, and H. Hjalmarsson, “Cor-relation of distortion noise between the branches of MIMO transmitantennas,” in

Proc. IEEE Int. Symp. Personal, Indoor and Mobile RadioCommun. (PIMRC) , 2012.[60] A. Mezghani, M.-S. Khouﬁ, and J. Nossek, “A modiﬁed MMSE receiverfor quantized MIMO systems,” in

Proc. ITG/IEEE Workshop on SmartAntennas (WSA) , 2007.[61] A. Demir, A. Mehrotra, and J. Roychowdhury, “Phase noise in oscil-lators: A unifying theory and numerical methods for characterization,”

IEEE Trans. Circuits Syst. I , vol. 47, no. 5, pp. 655–674, 2000.[62] D. Petrovic, W. Rave, and G. Fettweis, “Effects of phase noise on OFDMsystems with and without PLL: Characterization and compensation,”

IEEE Trans. Commun. , vol. 55, no. 8, pp. 1607–1616, 2007.[63] G. Durisi, A. Tarable, C. Camarda, and G. Montorsi, “On the capacityof MIMO Wiener phase-noise channels,” in

Proc. Information Theoryand Applications Workshop (ITA) , 2013.[64] H. Mehrpouyan, A. Nasir, S. Blostein, T. Eriksson, G. Karagiannidis,and T. Svensson, “Joint estimation of channel and oscillator phase noisein MIMO systems,”

IEEE Trans. Signal Process. , vol. 60, no. 9, pp.4790–4807, 2012.[65] Y. Nasser, M. Noes, L. Ros, and G. Jourdain, “On the system levelprediction of joint time frequency spreading systems with carrier phasenoise,”

IEEE Trans. Commun. , vol. 58, no. 3, pp. 839–850, 2010.[66] R. Baldemair, E. Dahlman, G. Fodor, G. Mildh, S. Parkvall, Y. Selen,H. Tullberg, and K. Balachandran, “Evolving wireless communications:Addressing the challenges and expectations of the future,”

IEEE Veh.Technol. Mag. , vol. 8, no. 1, pp. 24–30, 2013.[67] C. Shepard, H. Yu, N. Anand, L. Li, T. Marzetta, R. Yang, and L. Zhong,“Argos: Practical many-antenna base stations,” in

Proc. ACM MobiCom ,2012.[68] K. Nishimori, K. Cho, Y. Takatori, and T. Hori, “Automatic calibrationmethod using transmitting signals of an adaptive array for TDD sys-tems,”

IEEE Trans. Veh. Technol. , vol. 50, no. 6, pp. 1636–1640, 2001.[69] R. Rogalin, O. Bursalioglu, H. Papadopoulos, G. Caire, and A. Molisch,“Downlink beamforming avoiding DOA estimation for cellular mobilecommunications,” in

Proc. Information Theory and Applications Work-shop (ITA) , 2013.[70] J. Silverstein and Z. Bai, “On the empirical distribution of eigenvalues ofa class of large dimensional random matrices,”

J. Multivariate Analysis ,vol. 54, no. 2, pp. 175–192, 1995.[71] Z. Bai and J. Silverstein,

Spectral analysis of large dimensional randommatrices , 2nd ed. Springer Series in Statistics, 2009.[72] A. Tulino and S. Verd´u, “Random Matrix Theory and Wireless Commu-nications,”