[PDF] Fully Decentralized Massive MIMO Detection Based on Recursive Methods

Abstract

Algorithms for Massive MIMO uplink detection typically rely on a centralized approach, by which baseband data from all antennas modules are routed to a central node in order to be processed. In case of Massive MIMO, where hundreds or thousands of antennas are expected in the base-station, this architecture leads to a bottleneck, with critical limitations in terms of interconnection bandwidth requirements. This paper presents a fully decentralized architecture and algorithms for Massive MIMO uplink based on recursive methods, which do not require a central node for the detection process. Through a recursive approach and very low complexity operations, the proposed algorithms provide a sequence of estimates that converge asymptotically to the zero-forcing solution, without the need of specific hardware for matrix inversion. The proposed solution achieves significantly lower interconnection data-rate than other architectures, enabling future scalability.

Full PDF

FFully Decentralized Massive MIMO DetectionBased on Recursive Methods

Jes´us Rodr´ıguez S´anchez, Fredrik Rusek, Muris Sarajli´c, Ove Edfors and Liang Liu

Department of Electrical and Information Technology, Lund University, Sweden { jesus.rodriguez, fredrik.rusek, muris.sarajlic, ove.edfors, liang.liu } @eit.lth.se Abstract —Algorithms for Massive MIMO uplink detectiontypically rely on a centralized approach, by which baseband datafrom all antennas modules are routed to a central node in orderto be processed. In case of Massive MIMO, where hundredsor thousands of antennas are expected in the base-station, thisarchitecture leads to a bottleneck, with critical limitations interms of interconnection bandwidth requirements. This paperpresents a fully decentralized architecture and algorithms forMassive MIMO uplink based on recursive methods, which donot require a central node for the detection process. Through arecursive approach and very low complexity operations, the pro-posed algorithms provide a sequence of estimates that convergeasymptotically to the zero-forcing solution, without the need ofspeciﬁc hardware for matrix inversion. The proposed solutionachieves signiﬁcantly lower interconnection data-rate than otherarchitectures, enabling future scalability.

Index Terms —Massive MIMO, Stochastic Approximation,Gradient Descent, Recursive Least Squares, Decentralized, De-tection and zero-forcing

I. I

NTRODUCTION

Massive multi-user (MU) multiple-input multiple-output(MIMO) is one of the most promising technologies in thewireless area [1]. High spectral efﬁciency and improved linkreliability are among the key features of this technology,making it a key enabler to exploit spatial diversity far beyondtraditional MIMO systems by employing a large scale antennaarray with hundreds or thousands of elements. This allows forunprecedented spatial resolution and high spectral efﬁciency,while providing simultaneous service to several users withinthe same time-frequency resource.Despite all advantages of Massive MIMO, there are chal-lenges from an implementation point of view. Uplink detectionalgorithms like zero-forcing (ZF) typically rely on a central-ized architecture, shown in Figure 1a, where baseband samplesand channel state information (CSI) are collected in the centralnode for further matrix inversion and detection. Dedicatedlinks are needed between antenna modules and central nodeto carry this data. This approach, that is perfectly valid for arelatively low number of antennas, shows critical limitationswhen the array size increases, with interconnection bandwidthquickly becoming a bottleneck in the system.Previous work has been done proposing different architec-tures for Massive MIMO base-stations [2]–[6]. All of themconclude by pointing to the interconnection bandwidth asthe main implementation bottleneck and a limiting factorfor array scalability. Most of them recommend moving to a decentralized approach where uplink detection and downlinkprecoding can be performed locally in processing nodes closeto the antennas. However, to achieve that, CSI still needs tobe collected in a central node, where matrix inversion is doneand the result distributed back to all modules [2], [3], [5]. Afurther step has been made in [6], where CSI is obtained andused only locally (not shared) for precoding and detection.This architecture relies on a central node only for processingpartial results. This dependency on a central node limits thescalability of this solution as will be shown in section IV.In this paper we propose a fully decentralized architectureand recursive algorithms for Massive MIMO uplink detection.Antennas in the array are grouped into clusters. Apart fromantennas, clusters contain RF, Analog-to-Digital Converters(ADC), OFDM receiver, channel estimation and detectionblocks. The decentralized topology is based on the directconnection of clusters forming a daisy-chain structure asshown in Figure 1b. The proposed algorithms are pipelinedso that they run in a distributed way at the cluster nodes,providing a sequence of estimates that converge asymptoticallyto the zero-forcing solution. We will make use of the followingalgorithms: Recursive Least Square (RLS), Stochastic GradientDescent (SGD) and Averaged Stochastic Gradient Descent(ASGD), which are detailed in section III.Decentralized architectures overcome bottlenecks by ﬁndinga more equal distribution of the system requirements amongthe processing nodes of the system. Apart from this, datalocalization is a key characteristic of decentralized architec-tures. This architecture allows data to be consumed as closeas possible to where it is generated, minimizing the amount totransfer, and therefore saving bandwidth and energy. Followingthis idea, processing nodes need to be located near the antenna.Further, they perform tasks such as channel estimation anddetection locally. Local CSI is estimated and stored locally ineach, without any need to share it with any other nodes in thesystem.The remainder of the paper is organized as follows. Thesystem model for MIMO uplink is presented in section II.In section III we introduce the proposed algorithms. In IVwe analyze the performance of these algorithms, present theadvantages of the daisy-chain topology, and analyze intercon-nection data-rates. Finally, section V presents the conclusionsof this publication.Notation: In this paper, lowercase, bold lowercase and a r X i v : . [ ee ss . SP ] A ug ase Station BASEBANDPROCESSING

NODE M RF OFDMRF OFDM. . . . .. . .. B ... RF OFDMRF OFDM. . . .. . .. . C ...  CHEST (h n )  DET  DEC BB RF OFDMRF OFDM... . . .. . . B {y y B }{y M-B, y M }{y B+1, y } .. . .. . .. . cluster R (a) Centralized architecture Base Station RF OFDM

DET

RF OFDM. . . . . .. . . CHEST B . .. C  DEC B RF OFDM

DET

RF OFDM .. . .. . .. . CHEST B BASEBANDPROCESSINGNODE

RF OFDM

DET

RF OFDM . .. . .. . .. CHEST B ŝ B ŝ ŝ M-B ŝ M . . .. . . {h h B } {h B+1, h } {h M-B, h M } cluster R . . . (b) Decentralized architectureFig. 1. Comparison between base station receiver chain in centralized and fully decentralized architectures for Massive MIMO uplink. Antenna array with Melements is divided into C clusters, each containing B antennas. (a): Centralized architecture. Clusters contain RF ampliﬁers and frequency down-conversion(RF) elements, analog-to-digital converters (ADC) and OFDM receivers. Each cluster has one link to transfer baseband samples to a central baseband processingnode, where the rest of processing tasks are done. (b): Fully decentralized architecture for detection. Clusters performs RF, ADC, OFDM, channel estimation(CHEST) and detection (DET) locally. Decoding (DEC) is centralized. Clusters are connected to each other by uni-directional links. Only one cluster hasa direct connection with central node. Proposed algorithms are executed in DET blocks in parallel mode. The points where the interconnection data-rate isestimated are marked by circles and the value is denoted by R . upper bold face letters stand for scalar, column vector andmatrix, respectively. The operations ( . ) T , ( . ) ∗ and ( . ) H denotetranspose, conjugate and conjugate transpose respectively. Thevector s in the n th iteration is s n . Computational complexityis measured in terms of the number of complex-valuedmultiplications.II. S YSTEM MODEL AND DETECTION ALGORITHMS

In this section we present the system model for MIMOuplink and introduce the ZF equalizer.We consider a scenario with K single-antenna users trans-mitting to an antenna array with M elements. The input-outputrelation for uplink is y = Hs + v , (1)where y is the M × received vector, s is the transmitted userdata vector ( K × ), H = [ h h · · · h M ] T is the channelmatrix ( M × K ) and v samples of noise ( M × ). Under theMassive MIMO assumption, M (cid:29) K .Assuming time-frequency-based channel access, a ResourceElement (RE) represents a slot in the time-frequency grid.Within each RE, the channel model follows (1).A least-squares (LS) estimate of s is obtained as ˆ s ZF = ( H H H ) − H H y . (2)This method, commonly referred to as ZF, requires a centralarchitecture as in Figure 1a because the complete matrix H needs to be collected in the central node before the Gramian matrix ( H H H ) and its inverse can be computed. Decentralizedarchitectures, such as the one shown in Figure 1b, require othertype of algorithms.III. P ROPOSED A LGORITHMS

In this section we propose three algorithms for MIMOdecentralized uplink detection.Depending on the situation some algorithms are moreappropriate than others. If full knowledge of matrix H and y is available at a single node, direct methods such as ZF canbe applied (2). However, there are situations when the costof collecting all knowledge at a single node is too high. Forthose cases, a different approach has to be used.The goal of the proposed algorithms for uplink detectionis the estimation of the transmitted user data vector, s in (1),based on knowledge of H and y that is distributed amongnodes. These algorithms provide a sequence of estimates,which converge to ˆ s ZF as more knowledge of H and y is obtained. Estimation is done in a sequential manner, bywhich the estimate is passed from one antenna to the nextone, being updated every time based on the previous estimate( ˆ s n − ), local CSI ( h n ) and antenna observation ( y n ). Thiscan be summarized as ˆ s n = f (ˆ s n − , h n , y n ) , which can beseen as a recursive form. This approach is in accordance withthe data localization principle, which is a key characteristicof decentralized systems. In the Massive MIMO case, datais consumed close to where it is generated, namely at theantennas. This makes it possible that neither h n nor y n areshared, since only the estimate is.hese algorithms are ﬂexible enough to work in clusters ofantennas (see Figure 1b), whose size can vary from 1 up to M , the last case being equivalent to a centralized system.The ﬁrst recursive algorithm to be presented is the RecursiveLeast Square (RLS) method, which is a recursive form of (2).Uplink detection can be also seen as a regression parameterestimation - a problem well studied in the area of stochasticapproximation methods. Stochastic Gradient Descent (SGD)and its averaged version (ASGD) fall within this group, and arebased on a Gradient Descent algorithm in which the gradientis partially known.In Section III-A we present RLS applied to MIMO uplinkdetection, which provides approximate ZF performance at theexpense of a preprocessing stage. Afterwards, we present theSGD algorithm and its enhanced version, the Averaged SGD(ASGD), which increases robustness of SGD while achievingperformance close to ZF for very large arrays.Before we describe the algorithms we clarify the role playedby the variable B , i.e., the number of antennas per cluster. Ouralgorithms are in fact independent of the value of B , thereforewe present them with notation tailored to the choice B = 1 .However, B > is still of importance from an implementationpoint of view since each cluster may be implemented with asingle processing unit. Thus, with M = 100 antennas, thechoice B = 1 requires 100 processing units, while B = 10 merely requires 10 such units. Nevertheless, performance ofour algorithms remains the same. B therefore takes a trade-offrole: The larger the B , the less number of processing units,but meanwhile, the architecture becomes more centralized. A. Recursive Least-Squares (RLS)

RLS is the recursive version of the LS algorithm. It canbe shown [7] that the ZF/LS estimate, i.e., the l.h.s. of (2),can be approximated by the RLS as ˆ s ZF ≈ ˆ s M where ˆ s n isrecursively found as follows ε n = y n − h Tn ˆ s n − Γ n = Γ n − − Γ n − h ∗ n h Tn Γ n − h Tn Γ n − h ∗ n , ˆ s n = ˆ s n − + Γ n h ∗ n ε n . (3)The quality of the approximation depends on the initial valueof ˆ s . Nevertheless, for a randomly chosen ˆ s , the impact of ˆ s quickly fades out over the index n and it can be shownthat s M → ˆ s ZF as M → ∞ with probability one. In (3), ˆ s n is a K × vector and is the output of cluster n , y n is theobservation at the n th antenna, ε n is the prediction error and Γ n is a K × K matrix. As a side comment, we remark that ˆ s n is an approximate LS solution up to the n th antenna element.In view of Figure 1b, increasing the iteration number in (3)from n to n + 1 corresponds to passing on information fromcluster n to cluster n + 1 . Each cluster receives an estimate ofthe transmitted data vector from previous cluster, ˆ s n − , andcompute a new estimate ˆ s n based on local CSI, h n , and alocal observation, y n .Under the block fading channel model, multiple ResourceElements (RE) in a certain region of the time-frequency grid experience identical channels. We name this region CoherenceBlock (CB), and following this model it is possible to re-usesame CSI for all REs in the same CB.Straightforward implementation of (3) at every RE is notefﬁcient. In fact, a hefty share of the operations associated to(3) can be reused within the CB, namely those associated tocomputation of Γ n . Deﬁning Γ = I K and z n = Γ n − h ∗ n α n = 11 + h Tn z n Γ n = Γ n − − α n z n z Hn , n = 1 , , . . . , M we see that at each RE it sufﬁces to compute ε n = y n − h Tn ˆ s n − ˆ s n = ˆ s n − + α n z n ε n , n = 1 , , . . . , M in order to execute (3). It is easily veriﬁable that the complex-ity of preprocessing is O ( K ) , whilst the complexity is O ( K ) at every RE. B. Stochastic Gradient Descent (SGD)

The setup in SGD [8] is that one intends to solve theunconstrained LS problem min s (cid:107) y − Hs (cid:107) (4)via a gradient descent (GD) approach. The gradient of (4)equals ∇ s = H H Hs − H H y . An immediate consequence isthat GD is only feasible in a centralized approach.SGD is an approximate version that can be operated in adecentralized architecture. It does so by computing, at eachcluster, as much as possible of ∇ s with the informationavailable at the cluster. Then the cluster updates the estimate ˆ s using a scaled version of the ”local” gradient and passes theupdated estimate on to the next cluster.The above described procedure can, formally, be stated as ε n = y n − h Tn ˆ s n − ˆ s n = ˆ s n − + µ n h ∗ n ε n , (5)where { µ n } is a sequence of scalar step-sizes. C. Averaged Stochastic Gradient Descent (ASGD)

Selection of optimum values µ n in SGD is not trivial. Eventhough we take µ n = µ for simpliﬁcation, the optimum valuewill depend on M , K and channel properties, where the lattermay be unknown in many cases. An inappropriate selectionof µ can lead to severe performance degradation dependingon the scenario. Averaging a SGD sequence provides anasymptotically optimal convergence rate provided that thenoise v is Gaussian [9], which increases robustness to thestep-size selection. In the ASGD algorithm there are threesequences deﬁned as follows ε n = y n − h Tn ˆ x n − ˆ x n = ˆ x n − + µ n h ∗ n ε n ˆ s n = (cid:40) ˆ x n if n < n n − n +1 (cid:80) nk = n ˆ x k if n ≥ n , (6)here ˆ x n takes the role of the SGD output ˆ s n in (5). TheASGD output ˆ s n thereby becomes an averaged SGD sequence,where n determines the onset of the averaging procedure.The averaged sequence can be written more conveniently as ˆ s n = (cid:40) ˆ x n if n < n ˆ s n − + n (cid:48) (ˆ x n − ˆ s n − ) if n ≥ n , (7)where n (cid:48) = n − n +1 . As will be seen in our numerical results,the ASGD grossly relaxes the need for careful selection of µ .IV. A NALYSIS

In this section we analyze the proposed solution. First,the performance of the presented algorithms will be shownand compared with each other. Second, a few strong pointsof the daisy-chain topology are given. Finally, an analysisof interconnection bandwidth is presented, followed by acomparison for four different conﬁgurations.

A. Detection Performance

In this section, we present performance results for allalgorithms. Reported metrics are Mean-Square-Error (MSE)and Bit-Error-Rate (BER) in block faded Rayleigh channels.We report MSE, measured between ˆ s and s , as a function ofthe number of iterations (antenna index). The reported signal-to-noise ratio (SNR) is the average receive power at any basestation antenna, divided by the noise variance.MSE results for SGD are shown in Figure 2 for threedifferent step-size values. As can be observed, step-size playsa critical role in the convergence speed of the algorithm. Highstep-size values provide faster convergence but high steady-state MSE, and low values may not even enter into the steady-state within the array. Given a certain M and K , it is possibleto ﬁnd an optimum step-size which provides the lowest MSE. Antenna index -30-25-20-15-10-50 M S E ( d B ) step size = 0.001step size = 0.002step size = 0.006 Fig. 2. MSE vs antenna index for three different step-size values in SGD.

We now turn our attention towards Figures 3 and 4 whichcompare RLS, SGD, and ASGD. When the SGD sequence isaveraged, MSE and BER curves get closer for different step-sizes, making the algorithm robust against non-optimal step-size selection. The selection of n also has an impact, butless compared to non-optimal step-size in SGD. As shown in Figure 4, RLS meets ZF (2) performance, as it is optimal for aGaussian noise source [9]. For large M , performance of RLSand ASGD converge due to ASGD’s asymptotically optimalrate property. B. Strengths of Daisy-Chain Topology

Ostensibly, it may come across as if our daisy-chain solutionincurs a latency penalty. This is, however, not the case as thedetection process over time and/or frequency can be pipelined.While cluster 2 is processing data at subcarrier, say, f , cluster1 can process data at subcarrier f + 1 . In the next iteration,cluster 2 processes data at subcarrier f + 1 , etc. See Figure5 for a graphical visualization of the pipelining procedure.Further, our daisy-chain solution allows for a power savesince if a cluster n regards its incoming estimate to besufﬁciently good, then it can do one out of at least twothings, 1) set ˆ s n +1 = ˆ s n , or 2) send the incoming estimate ˆ s n to the baseband processing node, thereby terminating thedetection procedure. The former has the advantage over thelatter that only the last cluster needs to be connected to thebaseband processing unit. Further, an indication whether ornot the incoming estimate is of sufﬁciently good quality canbe obtained, e.g. for RLS, by the value ε n in (3).Finally, our topology is ﬂexible so that additional antennaclusters can be added in a plug-and-play fashion. For example,in order to double the number of antennas, it is merelyrequired to disconnect the cable between the last cluster andthe baseband processing unit, connect that very cable to the lastcluster of the added antenna array, and connect the two arrays.This will solely impact software scheduling at the basebandprocessing unit, but not the hardware as would have been thecase for the centralized topology in Figure 1a. C. Interconnection Data-Rate

In order to estimate the expected data-rate in the proposedarchitecture, we can assume an OFDM-based frame structurebased on slots. Each slot is made by N slot consecutiveOFDM symbols with duration T ofdm . Each symbol contains N u subcarriers (an RE in OFDM) to carry user data. We candetermine the average input/output data rate in the uplink foreach of the clusters during a certain slot for SGD as follows ¯ R SGD = K · w s · N u · N UL T slot = α · K · w s · N u T ofdm , (8)where T slot is the slot duration, N UL is the number of OFDMsymbols allocated for UL data in a slot, w s is the numberof bits used to represent each element in the sequence ofestimates ( ˆ s n ) and α = N UL T slot represents the fraction of timespent in UL within the slot, so ≤ α ≤ . In Figure 1b, ¯ R SGD corresponds to R.This analysis does not take into account the total amountof data that is generated (which depends on M ) and needsto move through the structure, but only the data that movesbetween clusters (which depends on K ) because it is the onethat imposes physical constraints in the inter-cluster connec-tions and may limit the scalability. Antenna index -35-30-25-20-15-10-50 M S E ( d B ) RLSSGD step size = 0.04ASGD step size = 0.04SGD step size = 0.02ASGD step size = 0.02

200 400 600 800 1000 1200 1400 1600 1800 2000

Antenna index -35-30-25-20-15-10-50 M S E ( d B ) RLSSGD step size = 0.008ASGD step size = 0.008SGD step size = 0.004ASGD step size = 0.004

Fig. 3. MSE vs antenna index for RLS, SGD and ASGD for different step-size values. Left: M=256. n =150 and 75 for µ =0.02 and 0.04 respectively. Right:M=2048. n =1000 and 400 for µ =0.004 and 0.008 respectively. K=16 and SNR=12dB in all cases. -2 0 2 4 6 8 10 SNR (dB) -6 -5 -4 -3 -2 -1 B i t E rr o r R a t e RLSASGD step size = 0.04ASGD step size = 0.02ZF -4 -2 0 2 4 6

SNR (dB) -6 -5 -4 -3 -2 -1 B i t E rr o r R a t e RLSASGD step size = 0.008ASGD step size = 0.004ZF

Fig. 4. BER vs SNR for RLS, ASGD and ZF. Left: M=256, 16QAM. Right: M=2048, 64QAM. K=16 and SNR=12dB for both cases.

For ASGD, the averaged data rate is expected to be twice theone in SGD, because for each sequence element, two previouselements, ˆ x n and ˆ s n − , are needed as can be observed in (7),and therefore ¯ R ASGD = 2 · ¯ R SGD . (9)For RLS, the data-rate has two components, one due to thepreprocessing stage and the other one due to each RE. Duringthe ﬁrst stage, matrix Γ is passed from cluster to cluster.During the RE processing stage, data rate is the same as inSGD. The averaged data rate for RLS is calculated as ¯ R RLS = N CB · K · w γ T slot + K · w s · N u · N UL T slot = N u · N slot S CB · K · w γ T ofdm · N slot + α · K · w s · N u T ofdm = α · K · w s · N u T ofdm · (cid:18) βα · KS CB (cid:19) , (10)where N CB is the number of CBs per slot, S CB the numberof REs in each CB, w γ the number of bits to represent each element in Γ and β = w γ w s . From (10) it can be seen that ¯ R RLS > ¯ R SGD .We can compare our proposed solution with another cluster-based decentralized architecture, but which relies on a centralnode to collect partial results, performing a low complexityoperation, such as averaging, and broadcasting back the resultto the clusters according to an iterative algorithm. This startopology has been proposed in [6]. In this case, the centralnode will have C bi-directional links with an average aggre-gated data rate per direction of ¯ R star = C · n iter · ¯ R SGD , (11)where n iter is the number of iterations for the selected de-tection algorithm. From (11) we can observe that typically ¯ R star (cid:29) ¯ R SGD .In case of a fully centralized architecture as the one in [5],the interconnection data-rate depends linearly on M as follows ¯ R central = M · N u · N UL · w sc T slot = α · M · N u · w sc T ofdm , (12) ilots Data Tslot

OFDMsymbolsCluster 1

PREP

MIMO

MIMO ...

PREP T MIMO

PREP PREP

MIMO T ... ...... ... Fig. 5. Time diagram representing cluster activities during one uplink slot with 4 OFDM symbol: one pilot and three data symbols. Only cluster 1 and C(last one) are represented for simplicity. Two types of activities are shown per cluster. The ﬁrst one represents Pre-processing stage (PREP), only if RLS isused. The second one is the MIMO activity, which involves detection. First cluster start processing ﬁrst RE after complete reception of Data 1. Once suchRE is processed, it is then passed to next cluster for further processing and this is repeated successively through all clusters. As it is shown in the ﬁgure,there is a delay (T) for the starting time in cluster C compared to ﬁrst cluster. T needs to be small enough to meet latency constraints.TABLE ID

ATA R ATE COMPARISON FOR DIFFERENT TOPOLOGIES / ALGORITHMS M

128 256 512 1024 K

16 32 64 128 C B

16 32 32 64 ¯ R SGD ¯ R RLS ¯ R ASGD ¯ R star [6] 10.3GB/s 10.3GB/s 20.6GB/s 20.6GB/s ¯ R central [5] 5.1GB/s 10.2GB/s 20.4GB/s 40.8GB/s where w sc is the number of bits representing a sample ofthe received signal y . It is seen that (12) cannot scale easily. ¯ R central corresponds to R in Figure 1a. Going from (12) to (8),roughly reduces the data-rate by a factor MK (typically ≥ in Massive MIMO).Table I shows date-rates for four scenarios. We assume thefollowing parameters: T slot = 500 µs , w s = 16 , w sc = 24 , N u = 1200 , N slot = 7 , N UL = 6 , α = 6 / , β = 3 / , S CB = 400 and n iter = 3 . We can observe that theanalyzed topology and algorithms achieve signiﬁcantly lowerinterconnection data-rate than other architectures [5] [6], en-abling future scalability. As observed, for very-large arrays,RLS and ASGD require similar data-rates and have similarperformance, but RLS requires a pre-processing stage andmatrix manipulation that ASGD does not.V. C ONCLUSIONS

In this article we have introduced a base station uplinkarchitecture for Massive MIMO and analyzed the main imple-mentation bottleneck, the interconnection data-rate. We haveproposed three algorithms and a fully decentralized topologyfor uplink detection, which alleviate this limitation. One of the algorithms (RLS) achieves approximate zero-forcing per-formance, while another (ASGD) is an approximation whichconverges to the former one for very large arrays. All of themare of low-complexity and do not require matrix inversion.An estimate of data-rate is also presented and compared withother architectures for different array-sizes and conﬁgurations,showing the beneﬁts of the proposed solution.A

CKNOWLEDGMENT

This work was supported by ELLIIT, the Excellence Centerat Link¨oping-Lund in Information Technlology.R

EFERENCES[1] T. L. Marzetta, “Noncooperative cellular wireless with unlimited num-bers of base station antennas,”

IEEE Transactions on Wireless Commu-nications , vol. 9, no. 11, pp. 3590–3600, November 2010.[2] C. Shepard et al. , “Argos: Practical many-antenna base stations,” in

Proceedings of the 18th Annual International Conference on MobileComputing and Networking (Mobicom) , New York, NY, USA, 2012,pp. 53–64. [Online]. Available: http://doi.acm.org/10.1145/2348543.2348553[3] E. Bertilsson, O. Gustafsson, and E. G. Larsson, “A scalable architecturefor massive MIMO base stations using distributed processing,” in , Nov2016, pp. 864–868.[4] A. Puglielli et al. , “Design of energy- and cost-efﬁcient massive MIMOarrays,”

Proceedings of the IEEE , vol. 104, no. 3, pp. 586–606, March2016.[5] S. Malkowsky et al. , “The world’s ﬁrst real-time testbed for massiveMIMO: Design, implementation, and validation,”

IEEE Access , vol. 5,pp. 9073–9088, 2017.[6] K. Li, R. R. Sharan, Y. Chen, T. Goldstein, J. R. Cavallaro, and C. Studer,“Decentralized baseband processing for massive MU-MIMO systems,”

IEEE Journal on Emerging and Selected Topics in Circuits and Systems ,vol. 7, no. 4, pp. 491–507, Dec 2017.[7] L. Ljung and T. S¨oderstr¨om,

Theory and Practice of Recursive Identiﬁ-cation . The MIT Press, 1983.[8] H. J. Kushner and G. G. Yin,

Stochastic Approximation and RecursiveAlgorithms and Applications , 2nd ed. Springer, 2003.[9] B. Polyak and A. B. Juditsky, “Acceleration of stochastic approximationby averaging,”

SIAM Journal on Control and Optimization , vol. 30, pp.838–855, July 1992.[10] B. Polyak and Y. Z. Tsypkin, “Adaptive estimation algorithms (conver-gence, optimality, stability),”