[PDF] Federated Edge Learning with Misaligned Over-The-Air Computation

Abstract

Over-the-air computation (OAC) is a promising technique to realize fast model aggregation in the uplink of federated edge learning. OAC, however, hinges on accurate channel-gain precoding and strict synchronization among the edge devices, which are challenging in practice. As such, how to design the maximum likelihood (ML) estimator in the presence of residual channel-gain mismatch and asynchronies is an open problem. To fill this gap, this paper formulates the problem of misaligned OAC for federated edge learning and puts forth a whitened matched filtering and sampling scheme to obtain oversampled, but independent, samples from the misaligned and overlapped signals. Given the whitened samples, a sum-product ML estimator and an aligned-sample estimator are devised to estimate the arithmetic sum of the transmitted symbols. In particular, the computational complexity of our sum-product ML estimator is linear in the packet length and hence is significantly lower than the conventional ML estimator. Extensive simulations on the test accuracy versus the average received energy per symbol to noise power spectral density ratio (EsN0) yield two main results: 1) In the low EsN0 regime, the aligned-sample estimator can achieve superior test accuracy provided that the phase misalignment is non-severe. In contrast, the ML estimator does not work well due to the error propagation and noise enhancement in the estimation process. 2) In the high EsN0 regime, the ML estimator attains the optimal learning performance regardless of the severity of phase misalignment. On the other hand, the aligned-sample estimator suffers from a test-accuracy loss caused by phase misalignment.

Full PDF

11 Federated Edge Learning with MisalignedOver-The-Air Computation

Yulin Shao,

Member, IEEE , Deniz G¨und¨uz,

Senior Member, IEEE , Soung Chang Liew,

Fellow, IEEE

Abstract —Over-the-air computation (OAC) is a promisingtechnique to realize fast model aggregation in the uplink offederated edge learning. OAC, however, hinges on accuratechannel-gain precoding and strict synchronization among theedge devices, which are challenging in practice. As such, how todesign the maximum likelihood (ML) estimator in the presenceof residual channel-gain mismatch and asynchronies is an openproblem. To ﬁll this gap, this paper formulates the problemof misaligned OAC for federated edge learning and puts fortha whitened matched ﬁltering and sampling scheme to obtainoversampled, but independent, samples from the misaligned andoverlapped signals. Given the whitened samples, a sum-productML estimator and an aligned-sample estimator are devised toestimate the arithmetic sum of the transmitted symbols. Inparticular, the computational complexity of our sum-productML estimator is linear in the packet length, and hence issigniﬁcantly lower than the conventional ML estimator. Extensivesimulations on the test accuracy versus the average receivedenergy per symbol to noise power spectral density ratio (EsN0)yield two main results: 1) In the low EsN0 regime, the aligned-sample estimator can achieve superior test accuracy providedthat the phase misalignment is non-severe. In contrast, the MLestimator does not work well due to the error propagation andnoise enhancement in the estimation process. 2) In the highEsN0 regime, the ML estimator attains the optimal learningperformance regardless of the severity of phase misalignment.On the other hand, the aligned-sample estimator suffers from atest-accuracy loss caused by phase misalignment.

Index Terms —Federated edge learning, over-the-air computa-tions, asynchronous, maximum likelihood estimation, sum prod-uct algorithm.

I. I

NTRODUCTION

With the increasing adoption of Internet of Things (IoT)devices and services, exponentially growing amount of data iscollected at the wireless network edge. Increasingly complexmachine learning (ML) models are trained and deployed togather intelligence from the data collected by edge devices[1], [2]. While this is conventionally done at a cloud server[3], ofﬂoading huge amounts of edge data to centralized cloudservers is not sustainable, and will potentially cause signiﬁcantnetwork congestion [4]. Moreover, data from edge devices

Y. Shao was with the Department of Information Engineering, The ChineseUniversity of Hong Kong, Shatin, New Territories, Hong Kong. He is now withthe Department of Electrical and Electronic Engineering, Imperial CollegeLondon, London SW7 2AZ, U.K. (e-mail: [email protected]). D. G¨und¨uz is withthe Department of Electrical and Electronic Engineering, Imperial CollegeLondon, London SW7 2AZ, U.K. (e-mail: [email protected]). S.C. Liew is with the Department of Information Engineering, The ChineseUniversity of Hong Kong, Shatin, New Territories, Hong Kong (e-mail:[email protected]).This work was supported by the European Research Council projectBEACON under grant number 677854, and by CHIST-ERA grant CHIST-ERA-18-SDCDN-001 (funded by EPSRC-EP/T023600/1). contain user-speciﬁc features, and centralized processing alsocauses privacy concerns. Federated learning (FL) has beenproposed as an alternative distributed solution to enable col-laborative on-device learning without sharing private trainingdata [5]–[7].FL is an iterative distributed learning algorithm. In itsbasic implementation, orchestrated by a parameter server (PS),each iteration of FL consists of four main steps [6]: 1)Downlink (DL) broadcast – a PS maintains a global modeland periodically broadcasts the latest global model to the edgedevices; 2) Local training – upon receiving the latest globalmodel, each edge device trains the model locally using its localdata set; 3) Uplink (UL) model aggregation – after training,all, or a subset, of devices transmit their model updates backto the PS; 4) Global model update – the PS updates theglobal model using the model updates collected from the edgedevices, typically by taking their average.In the case of edge devices, often the devices that collabo-rate to learn a common model are within physical proximity ofeach other, and are coordinated by a nearby access point, e.g.,a base station acting as the PS. In this, so-called federated edgelearning (FEEL) scenario [4], the UL model aggregation stepis particularly challenging as the wireless medium is sharedamong all the participating devices. Traditional radio ac-cess network (RAN) technologies distribute channel resourcesamong the devices by means of orthogonal multiple-accesstechnologies [8] (e.g., TDMA, CDMA, OFDMA). However,such orthogonal resource allocation techniques signiﬁcantlylimit the quality of model updates that can be sent fromindividual devices due to the limited channel resource thatcan be allocated to each device. We note that the number ofreal values to be transmitted by each device scales accordingto the neural network size. For today’s neural networks, thisnumber can easily run into hundreds of millions or more [9],and hence, is a heavy burden for the RANs.Analog over-the-air computation (OAC) is a promisingtechnique to realize uplink model aggregation in an efﬁcientmanner [10]–[18]. The basic idea of OAC is to create andleverage the inter-user interferences in the multiple-accesschannel (MAC) to boost throughput. When operated withOAC, devices send their model updates in an uncoded fashionby directly mapping each model parameter to a channelsymbol: each device ﬁrst precodes the transmitted symbolsby the inverse of the UL channel gain (assumed to beknown to the transmitter in advance) and then transmits theprecoded symbols to the receiver in an analog fashion. Allthe participating devices transmit simultaneously in the samecommunication link such that their signals overlap at the PS. a r X i v : . [ c s . I T ] F e b Provided that the channel-gain precoding and transmissiontiming are accurate, the fading MAC reduces to a GaussianMAC and the signal overlapped from the devices to the PSover-the-air naturally produces the arithmetic sum of the localmodel-updates.Compared with the traditional digital multiple-access sch-emes wherein the communication and computation constituteseparate processes, OAC is a joint computation-and-com-munication scheme exploiting the fact that the MAC inherentlyyields an additive superposed signal.The successful operation of OAC hinges on accuratechannel-gain precoding and strict synchronization among theparticipating devices [11], [12]. In practice, however, bothrequirements may not be perfectly fulﬁlled. On the onehand, the channel-gain precoding at the edge devices can beimperfect due to the inaccurate channel estimation and non-ideal hardware. The consequence is that there can be residualchannel-gain mismatch in the overlapped signals. On the otherhand, to meet the synchronization requirement, each devicehas to carefully calibrate the transmission timing – based onits distance from the PS and its moving speed – so that theirsignals overlap exactly with each other at the PS. This strictsynchronization across different devices is very expensive torealize in practice, and there can be residual asynchroniesamong the signals from different devices.With the residual channel gains and residual asynchroniesin the system, which we refer to as the misaligned OAC ,an open problem is how to estimate the arithmetic sum ofthe transmitted symbols from different devices [19]. Thispaper ﬁlls this gap and addresses the key problem in themisaligned OAC on how to devise the maximum likelihood(ML) estimator in the face of the channel-gain and timemisalignments among signals.Our main contributions are as follows:1) We formulate the problem of misaligned OAC for fed-erated learning considering a time-domain realization ofOAC. We put forth a whitened matched ﬁltering and sam-pling scheme that yields oversampled, but independent,samples from the overlapped signals. An ML estimatorfor the arithmetic sum based on the whitened samples isdevised.2) To tackle the inter-symbol and inter-device interferencesin the misaligned OAC, ML estimation requires the in-version of a large coefﬁcient matrix and hence is com-putationally intensive. In view of this, we dissect theinner structure of the whitened samples and put forth afactor-graph based ML estimator exploiting the sparsity ofthe coefﬁcient matrix. This factor-graph estimator, dubbedsum-product ML (SP-ML) estimator, interprets the com-positions of samples by a factor graph and computesthe likelihood functions via an analog message passingprocess on the graph. With the SP-ML estimator, the com-putational complexity of ML estimation is signiﬁcantlyreduced from Ω( L log L ) to Ω( L ) for a packet of length L .3) We identify two main problems of ML estimation in themisaligned OAC: error propagation and noise enhance-ment. As a result, ML estimation does not work well in the low average received energy per symbol to noisepower spectral density ratio (EsN0) regime. To tackle thisproblem, we further put forth an aligned-sample estimatorfor the misaligned OAC leveraging a subsequence ofwhitened samples wherein the symbols from differentdevices are “aligned”, i.e., the indexes of symbols fromdifferent devices are consistent in these samples. Thisestimator is shown to be a good alternative to the MLestimator in the low-EsN0 regime. The complexity ofthe aligned-sample estimator is also linear in the packetlength.4) With the ML and aligned-sample estimators for the mis-aligned OAC, we perform extensive simulations on theCIFAR dataset varying the degrees of time misalignment,phase misalignment, and EsN0. The learning performanceis measured by means of test accuracy, i.e., the achievedaccuracy of the learned neural network on the test dataset. We ﬁnd that i) When there is no phase misalignment,the test accuracies of the ML estimator and the aligned-sample estimator are on the same footing for variousdegrees of time misalignment. ii) When there is phasemisalignment, the ML estimator works only in the high-EsN0 regime whereas the aligned-sample estimator worksin both the low and high EsN0 regimes. Nevertheless, thealigned-sample estimator suffers from a loss in the test ac-curacy due to phase misalignment. In particular, the largerthe phase misalignment, the greater the test-accuracy loss.In the case of severe phase misalignment, the aligned-sample estimator leads to learning divergence even in thenoiseless case. The ML estimator, on the other hand, doesnot incur such test-accuracy loss even with severe phasemisalignment. iii) When there is phase misalignment, theML estimator beneﬁts from time asynchronicity while thealigned-sample estimator suffers from time asynchronic-ity. iv) Overall, the aligned-sample estimator is preferredin the low-EsN0 regime and the ML estimator is preferredin the high-EsN0 regime. Notations – We use boldface lowercase letters to denotecolumn vectors (e.g., θ , s ) and boldface uppercase letters todenote matrices (e.g., V , D ). For a vector or matrix, ( · ) (cid:62) denotes the transpose, ( · ) ∗ denotes the complex conjugate, ( · ) H denotes the conjugate transpose, and ( · ) † denotes theMoore-Penrose pseudoinverse. R and C stand for the sets ofreal and complex numbers, respectively. ( · ) r and ( · ) i stand forthe real and imaginary components of a complex symbol orvector, respectively. The imaginary unit is represented by j . C and CN stand for the real and complex Gaussian distributions,respectively. The cardinality of a set V is denoted by |V| . Thesign function is denoted by sgn ( · ) .II. S YSTEM M ODEL

We consider FEEL with the help of a wireless PS wherenearby edge devices with distinct local datasets collaborateover the shared wireless medium to train a common model,as shown in Fig. 1. The learning process goes through manyiterations. Without loss of generality, let us focus on one of theiterations, wherein M active devices participate in the training.The iteration proceeds as follows [6]: ⊕ Device 1 Device 2 ⋯ Device 𝑀෨ ℎ ,𝜏 AWGN ෨ ℎ 𝑀 ,𝜏 𝑀 ℎ ,𝜏 Figure 1. In federated edge learning, the edge devices collaboratively train ashared model with the help of a parameter server.

1) DL broadcast: at the beginning of the iteration, the PSbroadcasts the global model θ ∈ R d to the M edgedevices;2) Local training: each of the M devices trains the globalmodel θ on its local dataset B m of size B m and obtainsa new model ˜ θ m ∈ R d ;3) UL aggregation: each device scales the local model update ˜ θ m − θ by B m and transmits the scaled model update θ (cid:48) m = B m ( ˜ θ m − θ ) ∈ R d back to the PS.4) Arithmetic-sum estimation: the PS estimates the arithmeticsum of the transmitted model-updates θ (cid:48) m from the edgedevices: θ + = M (cid:88) m =1 θ (cid:48) m ; (1)5) Model update: the PS updates the global model by θ new = θ + 1 (cid:80) m B m θ + . (2)The updated global model θ new is then broadcasted in thenext iteration and the cycle continues. Remark.

To compute (2) , the sum of the dataset sizes (cid:80) m B m has to be known to the PS a priori. Thus, we let each devicetransmits the local database size B m to the PS reliably inadvance to the data transmission (in a digital way, withchannel coding and automatic repeat request, for example). Among the above ﬁve steps, the uplink model aggregationposes great challenges to the radio access networks. In thisstep, each device has to transmit d real numbers, where d can run into hundreds of millions or more. We consider theanalog over-the-air computation (OAC) to realize the uplinkmodel aggregation in this paper.When operated with the OAC, the edge devices transmit theraw model update θ (cid:48) m simultaneously to the PS in an analogmanner (without digital modulation and channel coding). ThePS, on the other hand, estimates the sum of the model update θ + directly from the received overlapped signal.The signal ﬂow is detailed as follows. Each device partitionsthe sequence of scaled model update θ (cid:48) m ∈ R d to twosubsequences θ (cid:48) m = [ s r m , s i m ] (cid:62) , where s r m , s i m ∈ R d/ , andconstructs a complex sequence s m ∈ C d/ : s m = s r m + j s i m . That is, the raw model update information θ (cid:48) m is carried onboth the real and imaginary parts of s m . Time is divided into slots, and each device transmits apacket of L symbols in each slot. Therefore, to transmitthe total number of d/ complex symbols, (cid:100) d/ L (cid:101) slots areneeded. Without loss of generality, we focus on the signalprocessing in one slot.The time-domain signal transmitted by the m -th device inone slot is given by x m ( t ) = α m L (cid:88) (cid:96) =1 s m [ (cid:96) ] p ( t − (cid:96)T ) , (3)where 1) p ( t ) = 1 / sgn ( t + T ) − sgn ( t )] is a rectangularpulse of duration T ; 2) α m is the channel precoding factor.Given an estimated channel coefﬁcient ¯ h m at the m -th device, α m is designed to be α m = 1 / ¯ h m . Each of the M edgedevices then carefully calibrates the transmission timing, basedon its distance from the PS and its moving speed, so thatthe signals from different devices arrive at the fusion centersimultaneously.After passing through the fading multiple-access channel(MAC), the signals x m ( t ) , ∀ m , overlap at the PS with relativetime offsets. The received signal r ( t ) can be written as r ( t ) = M (cid:88) m =1 ˜ h m x m ( t − τ m ) + z ( t ) , (4)where1) ˜ h m is the time-domain complex channel response. Weconsider ﬂat fading (frequency nonselective) and slowfading (time nonselective) channels [20]. That is, for thechannel between each device and the PS, the maximumdelay spread is less than the symbol period T so that thereis only one resolvable path with the channel response ˜ h m at the receiver, and ˜ h m is a constant over one packet.

2) Without loss of generality, we sort the M devices so thatthe devices with smaller indexes arrive at the receiverearlier. The delay of the ﬁrst device is set to τ = 0 , andthe relative delay of the i -th device with respect to theﬁrst device is denoted by τ m . We assume the time offsets τ m , ∀ m , are less than the symbol duration T , as shown inFig. 2. In the ideal case where the timing calibrations areprefect, the relative delays among packets τ m = 0 , ∀ m .3) z ( t ) is the zero-mean baseband complex AWGN, thedouble-sided power spectral densities of each of its realor imaginary part is N / for an aggregate of N .Substituting (3) into (4) gives us r ( t ) = M (cid:88) m =1 ˜ h m α m L (cid:88) (cid:96) =1 s m [ (cid:96) ] p ( t − τ m − (cid:96)T ) + z ( t )= L (cid:88) (cid:96) =1 M (cid:88) m =1 h (cid:48) m s m [ (cid:96) ] p ( t − τ m − (cid:96)T ) + z ( t ) , (5) In practice, the channel-gain precoding is limited by the maximumtransmission power of the edge devices. In the case of deep fading α m wouldbe very large, and we would have to clip α m to satisfy the peak or averagetransmission-power constraint. In our formulation, this may be one cause ofthe residual channel gains at the receiver. To ease exposition, the main body of this paper considers only the channel-gain misalignment, time misalignment, and slow fading channel. However, thesystem model can be easily generalized to OAC with residual carrier frequencyoffset (CFO) and fast fading channel, as detailed in Appendix A. 𝜏 𝑠 𝐿෨ℎ × 𝑠 𝐿෨ℎ × 𝑠 𝑀 𝑠 𝑀 𝑀 𝐿 𝑀 × ⋯⋯ 𝜏 𝑀 Figure 2. In each slot, the transmitted packets from different devices overlapat the PS with channel misalignments and relative time offsets. where h (cid:48) m = ˜ h m / ¯ h m is the residual channel-fading coefﬁcientbetween the m -th device and the PS.Succinctly speaking, there can be two kinds of misalign-ments among the signals transmitted from different devices: 1)channel-gain misalignment h (cid:48) m caused by inaccurate channel-gain precoding; and 2) time misalignment τ m caused byinaccurate calibration of the transmission timing.The objective of the PS is to estimate θ + = (cid:80) Mm =1 θ (cid:48) m ,i.e., the arithmetic sum of the local model updates. This isequivalent to estimate the arithmetic sum of the transmittedcomplex symbols s + where s + [ i ] = M (cid:88) m =1 s m [ i ] , because θ (cid:48) m are carried on the real and imaginary parts of thesequence s m . In other words, given the estimated sequence ˆ s + [ i ] , the estimated arithmetic sum of the local model updates ˆ θ + = (cid:2) ˆ s r + [ i ] , ˆ s i + [ i ] (cid:3) (cid:62) , Therefore, we shall focus on the estimation of the complexsymbols s + in the following sections. Remark.

This paper formulates the misaligned OAC forfederated edge learning considering a time-domain realizationof OAC. OAC can also be realized in the frequency domainvia OFDM. The connections and differences between the tworealizations are discussed in the conclusion.

III. A

LIGNED AND M ISALIGNED

OAC

A. The Aligned OAC

Prior works on OAC considered only the perfectly alignedcase [10], [11], [13], [14], [19] where there is neither channel-gain misalignment nor time misalignment, which we refer toas the aligned OAC . The assumptions behind are1) The accurate UL channel state information ˜ h m is knownto each device, and the end devices can precode thetransmitted symbols by α m = 1 / ˜ h m accurately. As a result,the channel gain ˜ h m is perfectly canceled when the signalarrives at the PS and the residual channel gain h (cid:48) m = 1 .2) Each of the M devices transmits at the exact moment sothat the relative time offsets among devices τ m = 0 .As such, the received signal r ( t ) = L (cid:88) (cid:96) =1 M (cid:88) m =1 s m [ (cid:96) ] p ( t − (cid:96)T ) + z ( t ) . (6)Matched ﬁltering r ( t ) by the same rectangular pulse p ( t ) and sampling at t = iT , i = 1 , , ..., L give us r [ i ] = 1 T (cid:90) iT ( i − T r ( t ) dt = M (cid:88) m =1 s m [ i ]+ z [ i ] = s + [ i ]+ z [ i ] , (7) where the noise sequence z [ i ] in the samples is independentand identically distributed (i.i.d.), z [ i ] ∼ CN (0 , N T ) .As can be seen, the target signal s + [ i ] appears explicitly onthe RHS of (7). In this context, the fading MAC degeneratesto a Gaussian MAC and the M devices can be abstracted asa single device transmitting the arithmetic sum of the localmodel updates directly to the PS. In practice, however, thechannel-gain precoding and the calibration of transmissiontiming can be inaccurate. With either channel-gain misalign-ment or time misalignment, samples as clean as (7) with s + explicitly presented are no longer available. B. The Misaligned OAC

With channel-gain and time misalignments, the receivedsignal r ( t ) is given in (5) and illustrated in Fig. 2. Let usﬁrst follow the standard signal processing ﬂow in digitalcommunications to process the received signal. Speciﬁcally,we ﬁrst matched ﬁlter r ( t ) by the rectangular pulse p ( t ) and then oversample the matched ﬁltered signal at { iT + τ k : i = 1 , , ..., L ; k = 1 , , ..., M } to collect sufﬁcientstatistics [21]. In so doing, the samples we get, denoted by { r k [ i ] : k = 1 , , ..., M ; i = 1 , , ..., L } , can be written as r k [ i ] = 1 T (cid:90) iT + τ k ( i − T + τ k r ( t ) ∗ p ( t ) dt (8) = 1 T M (cid:88) m =1 (cid:90) ( i − m>k ) T + τ m ( i − T + τ k h (cid:48) m s m [ i − m>k ] dζ + 1 T M (cid:88) m =1 (cid:90) iT + τ k ( i − m>k ) T + τ m h (cid:48) m s m [ i + mk ]+ M (cid:88) m =1 c (cid:48) m,k [ i ] s m [ i + mk ) T + τ m − τ k ] ,c (cid:48) m,k [ i ] = h (cid:48) m T [ m>k T + τ k − τ m ] . As shown in Appendix A, when there is residual CFO and thechannel is fast fading, the discrete samples can be written inthe same form as (8) with the coefﬁcients c m,k [ i ] and c (cid:48) m,k [ i ] being (29) and (30).The noise sequence { z k [ i ] } in (8) is colored since E [ z k [ i ] z k (cid:48) [ i (cid:48) ]] = 1 T (cid:90) iT + τ k ( i − T + τ k (cid:90) i (cid:48) T + τ k (cid:48) ( i (cid:48) − T + τ k (cid:48) z ( ζ ) z ( ζ (cid:48) ) dζ dζ (cid:48) =  N T λ ( i, i (cid:48) , k, k (cid:48) ) if λ ( i, i (cid:48) , k, k (cid:48) ) ∈ [0 , T ) , N T [2 T − λ ( i, i (cid:48) , k, k (cid:48) )] if λ ( i, i (cid:48) , k, k (cid:48) ) ∈ [ T, T ) , otherwise , (9)where λ ( i, i (cid:48) , k, k (cid:48) ) = ( i (cid:48) − i − T + τ k (cid:48) − τ k .Given the samples { r k [ i ] } in (8), we now set out to estimatethe desired arithmetic sum s + . First, the sequence of samples y k [ i ] can be written in a more compact form as r = As + z , (10) where the dimensionalities of r , s , and z are M L by , and thedimensionality of matrix A is M L by M L (see the detailedform of (10) at the top of the next page). Denote by Σ z thecovariance matrix of the noise sequence z , each element of Σ z can then be computed from (9).The desired sequence s + , on the other hand, can be writtenas a linear transformation of the complex vector s : s + = V s , (11)where the dimensionality of matrix V is L by M L , giving V =  × M × M ... × M  , in which is an all-ones matrix with subscript being itsdimensionality.Multiplying both sides of (10) by V A − gives us V A − r = s + + V A − z , (12)based on which a maximum likelihood estimator can bedevised, as in Deﬁnition 1. Deﬁnition 1 (ML estimation for the misaligned OAC) . Givena sequence of samples r ∈ C ML in (10) , the ML estimate ofsequence s + ∈ C L is ˆ s ml + = V A − r . (13)Eq. (13) follows directly from (12) since the likelihoodfunction of s + is an L -dimensional Gaussian distribution.Speciﬁcally, given an observed r , the likelihood function of s + is f ( V A − r | s + ) ∝ CN ( V A − r , V A − Σ z A − H V H ) . Differentiating f ( V A − r | s + ) with respect to s + gives usthe ML estimate ˆ s ml + in (13).An important implication of (13) is that, the maximum-likelihood s + can be obtained by ﬁrst ﬁnding the maximum-likelihood transmitted vector ˆ s ml = A − r , and then per-forming the arithmetic sum ˆ s ml + = V ˆ s ml . Said in anotherway, the ML estimation of s + boils down to multi-userestimation/decoding (MUD) when we have misaligned channelgains at the receiver. Remark.

The transmitted vector s carries the local weight-updates of a neural network. Therefore, s is by no means i.i.d.considering the strong correlations among the weights of theneural network. This also suggests that the prior informationof s is hard to come by and is unlikely to be known to the PSin advance. As a result, ML estimation is the only choice atthe receiver. Remark.

The result that the ML estimation of s + boilsdown to MUD is exclusive to the misaligned OAC system,wherein the transmitted symbols are continuous valued andthe channel gains are misaligned. In digital communications,we do not have such result for ML estimation. The reason forthis divergence is as follows. OAC is an analog communicationsystem wherein the transmitted symbols s are continuous complex values. In contrast, the transmitted symbols in digitalcommunications are discrete constellations. Whether the priorprobability distribution of the transmitted symbols is availableto the receiver or not, the constellation itself is a kind of priorinformation because the detection space is naturally narroweddown to the possible constellation points only. As a result, if weperform ML estimation in digital communications, the subtextis that all the constellations are equiprobable. In that case,the likelihood function f ( r | s ) is a Gaussian mixture insteadof Gaussian and the MUD-and-sum estimation is no longeroptimal if we were to estimate the arithmetic sum s + .On the other hand, when we perform ML estimation in OAC,all the complex space is assumed to be equiprobable. The onlyinformation we have is the noise-contaminated sample and thelikelihood function is a Gaussian centered around the noisysample. Two implications about the ML estimation in OACare thus 1) it faces an inﬁnitely large estimation space; 2) itcan be very susceptible to noise. The ML estimator in (13) is not a practical estimator due tothe prohibitive computational complexity of matrix inversion.To invert an n by n matrix, the best proven lower bound of thecomputational complexity is Ω( n log n ) [22]. Notice that thedimensionality of A is M L by M L . Thus, the computationalcomplexity of (13) is Ω( L M log( LM )) . In practical OACsystems, M cannot be too large due to the saturation of thereceiver (that is, the received signal power can exceed thedynamic range of the receiver if M is too large), but the packetlength L can be extremely large. Let us ﬁx M as a constant,the computational complexity of (13) is then Ω( L log L ) .To address this problem and devise an ML estimator withacceptable computational complexity, we put forth in Sec-tion IV a factor-graph based ML estimator by exploiting thesparsity of the coefﬁcient matrix. Compared with the MLestimator in (13), the computational complexity of the factor-graph based ML estimator is only Ω( L ) .IV. A S UM -P RODUCT M AXIMUM L IKELIHOOD E STIMATOR AND T HE A LIGNED - SAMPLE E STIMATOR

Before we dive deeper to dissect the inner structure of thecoefﬁcient matrix, let us ﬁrst introduce a new matched ﬁlteringand sampling scheme that gives us oversampled but indepen-dent samples, which we refer to as the “whitened matchedﬁltering and sampling”. Two beneﬁts of the whitened matchedﬁltering and sampling scheme are 1) the independent samplesobtained from the whitened scheme allows us to constructa factor graph with a simple structure, based on which alow-complexity sum-product ML (SP-ML) estimator can bedevised; 2) the whitened scheme yields a sequence of samplesin which the indexes of symbols from different devices areconsistent. This admits an aligned-sample estimator in themisaligned OAC.

A. Whitened Matched Filtering and Sampling

The key idea of the whitened matched ﬁltering and samplingscheme is to use a bank of matched ﬁlters of different lengthsto collect power judiciously from r ( t ) . Speciﬁcally, instead of  r [1] ...r M [1] r [2] ...r M [2] ...r [ L ] ...r M [ L ]  =  c , [1] c (cid:48) , [1] c (cid:48) , [1] ... c (cid:48) M, [1] c , [1] c , [1] c (cid:48) , [1] ... c (cid:48) M, [1] c (cid:48) , [2] ... ... ... ... ... ... ...c ,M [1] c ,M [1] c ,M [1] ... c M,M [1] c (cid:48) ,M [2] c (cid:48) ,M [2] ... c (cid:48) M − ,M [2] c , [1] c , [1] ... c M, [1] c , [2] c (cid:48) , [2] c (cid:48) , [2] ... c (cid:48) M, [2] c , [1] ... c M, [1] c , [2] c , [2] c (cid:48) , [2] ... c (cid:48) M, [2] c (cid:48) , [3] ... ... ... ... ... ... ... ... ...c ,M [2] c ,M [2] c ,M [2] ... c M,M [2] c (cid:48) ,M [3] c (cid:48) ,M [3] ... c (cid:48) M − ,M [3] ... ... ... ... ... ... ... ... ...  s [1] ...s M [1] s [2] ...s M [2] ...s [ L ] ...s M [ L ]  +  z [1] ...z M [1] z [2] ...z M [2] ...z [ L ] ...z M [ L ]  𝑑 = 𝜏 𝑑 = 𝜏 − 𝜏 𝑑 𝑀 = 𝑇 − 𝜏 𝑀 𝑦 𝑖 𝑦 𝑖𝑦 𝑀 𝑖 ⋯ 𝜏 𝜏 𝑀 𝑠 𝑠 𝐿𝑠 𝐿𝑠 𝑀 𝑀 𝑀 𝐿⋯⋯

Figure 3. Matched ﬁltering the received signal by a bank of M ﬁlters oflengths d k = τ k +1 − τ k . using the rectangular pulse p ( t ) as the matched ﬁlter as in(8), we deﬁne M matched ﬁlters { p (cid:48) k ( t ) : k = 1 , , ..., M } asfollows: p (cid:48) k ( t ) = 12 (cid:2) sgn ( t + T ) − sgn ( t + T − d k ) (cid:3) , (14)where the length of the k -th matched ﬁlter is d k = τ k +1 − τ k , k = 1 , , ..., M . For completeness, we deﬁne τ M +1 = T .The matched ﬁltering and sampling process are illustratedin Fig. 3. As shown, the signal ﬁltered by the k -th matchedﬁlter is given by y k ( t ) = 1 d k (cid:90) ∞−∞ r ( ζ ) p (cid:48) k ( t − ζ ) dζ and we sample y k ( t ) at ( i − T + τ k +1 : i = 1 , , ..., L, L + 1 ,giving y k [ i ] = y k ( t = ( i − T + τ k +1 ) = 1 d k (cid:90) ( i − T + τ k +1 ( i − T + τ k M (cid:88) m =1 h (cid:48) m s m [ i − m>k ] dζ + 1 d k (cid:90) ( i − T + τ k +1 ( i − T + τ k z ( ζ ) dζ (cid:44) M (cid:88) m =1 h (cid:48) m s m [ i − m>k ] + ˜ z k [ i ] , (15)where we have deﬁned s m [0] = s m [ L + 1] = 0 , ∀ m , forcompleteness.An important observation from (15) is that the noise term ˜ z k [ i ] ∼ CN (0 , N /d k ) and is independent for different k and i . This is because E [˜ z k [ i ]˜ z k (cid:48) [ i (cid:48) ]]= E (cid:34) d k (cid:90) ( i − T + τ k +1 ( i − T + τ k ˜ z ( ζ ) dζ d k (cid:48) (cid:90) ( i (cid:48) − T + τ k (cid:48) +1 ( i (cid:48) − T + τ k (cid:48) ˜ z ( ζ (cid:48) ) dζ (cid:48) (cid:35) = 1 d k d k (cid:48) (cid:90) ( i − T + τ k +1 ( i − T + τ k (cid:90) ( i (cid:48) − T + τ k (cid:48) +1 ( i (cid:48) − T + τ k (cid:48) E [˜ z ( ζ )˜ z ( ζ (cid:48) )] dζdζ (cid:48) = 1 d k d k (cid:48) (cid:90) ( i − T + τ m +1 ( i − T + τ m N δ (( i − i (cid:48) )( k − k (cid:48) )) dζ = N d k δ (( i − i (cid:48) )( k − k (cid:48) )) . (16)Similar to (10), we can rewrite (15) in a matrix form as y = Ds + ˜ z , (17)where the sequence of transmitted symbols s is the same asthat in (10). Unlike (10), the vectors y and ˜ z in (17) are M ( L + 1) − by dimensional, giving y = (cid:104) y [1] , y [1] , ..., y M [1] , y [2] , y [2] , ..., y M [2] , ...,y [ L ] , y [ L ] , ..., y M [ L ] , y [ L +1] , y [ L +1] , ..., y M − [ L +1] (cid:105) (cid:62) , ˜ z = (cid:104) ˜ z [1] , ˜ z [1] , ..., ˜ z M [1] , ˜ z [2] , ˜ z [2] , ..., ˜ z M [2] , ..., ˜ z [ L ] , ˜ z [ L ] , ..., ˜ z M [ L ] , ˜ z [ L +1] , ˜ z [ L +1] , ..., ˜ z M − [ L +1] (cid:105) (cid:62) , and the coefﬁcient matrix D is M ( L + 1) − by M L , giving D =  h (cid:48) h (cid:48) h (cid:48) ... h (cid:48) ...h (cid:48) ... ... h (cid:48) M h (cid:48) ... h (cid:48) M h (cid:48) ... ... h (cid:48) h (cid:48) h (cid:48) M ... h (cid:48) ...h (cid:48) ... ... h (cid:48) M h (cid:48) ... h (cid:48) M ...... ... ...h (cid:48) M ......  . When there is residual CFO and the channel is fast fading,matrix D can be written in the same form, as given in (32).The desired sequence at the PS is s + = V s , as in (11).Eq. (16) validates that the noise sequence ˜ z is white. In thislight, (17) can be viewed as a whitened model of (10), andhence we call this oversampled matched ﬁltering and samplingscheme the “whitened” matched ﬁltering and sampling.We emphasize that the signal model (17) is equivalent to(10) since they are sampled from the same received signal r ( t ) and no information is lost. More speciﬁcally, (17) can betransformed back to (10) after some elementary row operationsand row deletions. Nevertheless, the signal model (17) is morefavorable than (10) thanks to1) The whitened noise. As will be shown later, white noiseadmits a sample-by-sample factorization of the likelihoodfunction and a much simpler structure of the factor graph.2) The alleviated inter-symbol and inter-user interferences.A sample y k [ i ] is related to only M complex symbols,each of which comes from a different device. In contrast,a sample r k [ i ] in (10) is related to M − symbols. Remark.

Since (10) and (17) are equivalent, we can alsodesign the ML estimator from (17) . Denote the covariancematrix of ˜ z by Σ ˜ z (it is a diagonal matrix since ˜ z is white),the ML estimator is then given by ˆ s ml + = V ( D H Σ − z D ) − D H Σ − z y . (18) Again, the inversion of D H Σ − z D is computationally inten-sive, as in (13) .B. A Factor Graph Approach A possible way to reduce the complexity of the ML esti-mation is to exploit the sparsity of the coefﬁcient matrix D .To this end, let us focus on the ML estimate of a single entryin the desired sequence s + , i.e., ˆ s ml + [ i ] .Let s [ i ] = (cid:2) s [ i ] , s [ i ] , ..., s M [ i ] (cid:3) (cid:62) . Given an observedsample sequence y , we have ˆ s ml + [ i ] = arg max s + [ i ] f ( y | s + [ i ])= arg max s + [ i ] (cid:90) (cid:62) s [ i ] = s + [ i ] f ( y | s [ i ] ) d s [ i ] . In particular, f ( y | s [ i ] ) is a marginal function of f ( y | s ) . Thus,to ﬁnd the ML estimate ˆ s ml + [ i ] , a ﬁrst step is to analyze f ( y | s ) .In the following analysis, we will call f ( y | s ) the globallikelihood function and f ( y | s [ i ] ) the marginal likelihoodfunction.In ML estimation, the transmitted symbols s are treatedas constants. Randomness is only introduced by the noise se-quence ˜ z . Therefore, in the whitened model (17), the elementsof y are independent of each other. We can then factorize thelikelihood function f ( y | s ) as f ( y | s ) ∝ M (cid:89) k =1 L +1 (cid:89) i =1 f ( y k [ i ] | s ) ( a ) = M (cid:89) k =1 L +1 (cid:89) i =1 f ( y k [ i ] |V ( y k [ i ])) , (19)where (a) follows because a sample y k [ i ] is related to only M complex symbols in s . As per (15), we denote these symbolsby V ( y k [ i ]) = { s [ i ] , s [ i ] , ..., s k [ i ] , s k +1 [ i − , s k +2 [ i − ,..., s M [ i − } and call them the neighbor symbols of y k [ i ] .A sample y k [ i ] is then fully determined by the values ofits neighbor symbols. Note that the number of non-zerosymbols in V ( y k [ i ]) is the number of non-zero elements inthe corresponding row of D , giving |V ( y k [ i ]) | =  k, when i = 1; M, when ≤ i ≤ L ; M − k, when i = L + 1 . (20)Based on the factorization in (19), f ( y | s ) can be depictedby a graphical model [23]–[25]. As shown in Fig. 4, we usea Forney-style factor graph [24] to represent the factorization.In particular, each edge in the graph corresponds to a variablein (19), i.e., a complex symbol s m [ (cid:96) ] , an observation y k [ i ] , ora noise term ˜ z k [ i ] . To simplify notations, we denote them by s m,i , y k,i , and ˜ z k,i in Fig. 4, respectively.As can be seen, each sample y k [ i ] is related to a set ofsymbols V ( y k [ i ]) ; each complex symbol s k [ i ] , on the otherhand, is related to M samples. Thus, we duplicate each symbol s m,i for M times and connect them to M consecutive samples– the equality constraint function “ = ” means that the values ofthe variables connecting to this function must be equal. Theoutput degree of each equality constraint function is M , sois the input degree of each plus function “ + ” (except for the M − samples at both ends of the packet). The target symbols s + are plotted at the top of Fig. 4.The marginal likelihood function f ( y | s [ i ] ) is a marginalfunction of the global likelihood function f ( y | s ) . Thus, itcan be obtained by a marginalization process operated on thefactor graph, which can be implemented efﬁciently via thesum-product algorithm.However, a problem is that Fig. 4 is a loopy graph. Evenworse, the girth of the graph (the length of the shortest loop) is . Such short loops prevent the sum-product algorithm fromconverging [26]. Even if they converge, the performance ofthe sum-product algorithm often degrades greatly, and theequilibrium posterior distribution are only approximations ofthe true posterior distribution of the variables. Said in anotherway, the independence assumption of the extrinsic informationpassed on the edges no longer holds because the messages cancirculate indeﬁnitely around the loops.To circumvent this problem, below we transform the loopygraph in Fig. 4 to a loop-free graph, at the expense ofincreasing the dimension of the variables [23].For each sample y k [ i ] , we cluster all its neighbor symbols(i.e., the edges/variables that connect to y k [ i ] in Fig. 4) andconstruct a new higher-dimensional variable. Let us denotethe new high-dimensional variable by W k,i = V ( y k [ i ]) . Asillustrated in Fig. 5, each sample y k [ i ] is now connected to asingle high-dimensional variable W k,i after clustering and theloops are removed.After clustering, the new high-dimensional variables W k,i are correlated with each other because they contain commonsymbols. For example, W , = { s [1] } , W , = { s [1] ,s [1] } , and W , = { s [1] , s [1] , s [1] } , the common symbolbetween W , and W , is s [1] , that between W , and W , is s [1] , and that between W , and W , are s [1] , s [1] .Therefore, we have to add constraints among the new high-dimensional variables to ensure that the values of the commonsymbols are consistent in different variables.In Fig. 5, the compatibility function δ is added on the adja-cent variables to represent the above constraints. Speciﬁcally,for the two adjacent variables W and W (cid:48) connected to thesame delta function, the compatibility function δ ( W , W (cid:48) ) isdeﬁned as δ ( W , W (cid:48) ) =  , if the values of all common symbolsbetween W and W (cid:48) are equal ;0 , otherwise.That is, function δ is an on-off function that ensures that themessages passed from W to W (cid:48) and that from W (cid:48) and W satisfy the constraint that the values of the common symbolsbetween W and W (cid:48) are equal. Remark.

It is worth noting that adding compatibility func-tions between two adjacent variables is enough to depict allthe constraints. For example, let us consider W , , W , ⋯ ⋯+ + + + + + + + = = = = = =⋯ ⋯ 𝑠 𝑠 𝑠 𝑀,1 𝑠 𝑠 𝑠 𝑀,𝐿 𝑀 degree 𝑀 ⋯ ǁ𝑧 ǁ𝑧 ǁ𝑧 𝑀,1 ǁ𝑧 ǁ𝑧 ǁ𝑧 𝑀,𝐿 ǁ𝑧 ǁ𝑧 𝑀−1,𝐿+1 𝑦 𝑦 𝑦 𝑀,1 𝑦 𝑦 𝑦 𝑀,𝐿 𝑦 𝑦 𝑀−1,𝐿+1 ⋯+ + 𝑠 + + 𝐿 Figure 4. A graphical interpretation of the factorization in (19). To simplify notations, we denote y k [ i ] , z k [ i ] and s m [ i ] by y k,i , z k,i and s m,i in the ﬁgure,respectively. 𝑦 𝑦 𝑦 𝑀,1 𝑦 𝑦 𝑦 𝑀,𝐿 𝑦 𝑦 𝑀−1,𝐿+1 ǁ𝑧 ǁ𝑧 ǁ𝑧 𝑀,1 ǁ𝑧 ǁ𝑧 ǁ𝑧 𝑀,𝐿 ǁ𝑧 ǁ𝑧 𝑀−1,𝐿+1 ⋯ ⋯ ⋯+ + + + + + + += = = = = = × × × × × × 𝒉 ⊤ 𝑾 𝑴,𝑳 ⋯ × ×= =𝛿 𝛿 𝛿 𝛿 𝛿 𝛿 𝛿 𝑾 𝑴,𝟏 𝑾 𝑾 𝑾 𝑾 𝑾 𝑴−𝟏,𝑳+𝟏 𝑾 𝑠 + ⊤ 𝒉 ⊤ 𝒉 ⊤ 𝒉 ⊤ 𝒉 ⊤ 𝒉 ⊤ 𝒉 ⊤ ∑ 𝑠 + 𝐿 ∑ Figure 5. An equivalent tree structure to the loopy graph in Fig. 4. The variables connected to the same sample are clustered as a new high-dimensionalvariable. Each observation node is connected to a single variable node after clustering. and W , . In Fig. 5, we only add compatibility functions δ ( W , , W , ) and δ ( W , , W , ) . It is no need to add anextra compatibility function δ ( W , , W , ) between W , , W , although they have a common symbol s [1] . This isbecause W , and W , are independent conditioned on thecompatibility functions δ ( W , , W , ) and δ ( W , , W , ) .More speciﬁcally, δ ( W , , W , ) has ensured that the valueof s [1] in W , equals that in W , and δ ( W , , W , ) hasensured that the values of s [1] , s [1] in W , equals that in W , . Thus, the value of s [1] in W , and W , must beequal. Overall, the factor graph in Fig. 5 presents a tree struc-ture. Compared with Fig. 4, although the dimensions of thevariables are M times larger, the marginal likelihood function f ( y | s [ i ] ) can now be computed exactly via the sum-productalgorithm thanks to the tree structure. C. Analog Message Passing and the SP-ML Estimator

In standard sum-product algorithms, the messages passedon the edges are the probability mass function (PMF) of thevariables associated with the edges, i.e., a probability vectorof ﬁnite length [23], [25]. This is because the transmittedsymbols in digital communications are the QAM constellationswith ﬁnite alphabet size. However, in the case of OAC, thetransmitted symbols s m are continuous complex numbers. The messages passed on the edges should be the probabilitydensity functions (PDF) of the associated variables, which isa continuous function rather than a ﬁnite-length vector.To enable message passing, a straightforward idea is toquantize the PDF so that we can follow the digital messagepassing. The output of the SPA is then the marginal PDF in aquantized form. However, quantization suffers from the “curseof dimensionality” – when the dimension of the variablesincreases, the volume of the space increases very fast. Inorder to get a sound result, a signiﬁcantly larger number ofquantization levels is required in each dimension comparedwith that needed in the low-dimensional case [27]. In ourproblem, we have deliberately increased the dimensionalityof the variables to remove the loops in Fig. 4. Thus, thestandard sum-product algorithm with quantization cannot beused considering its prohibitive complexity.Now that quantization is not an option, the problem ishow to pass the continuous PDF on the edges of Fig. 5. Anatural idea is then to parameterize the PDF and pass theparameters instead of the PDF on the edges [24], [28]. Inactuality, the main effort of this section is to show that: givenan observed sample sequence, all the messages passed on thetree are multivariate Gaussian distributions . The multivariateGaussian PDF can then be parameterized by its mean vectorand covariance matrix – passing these parameters is equivalentto passing the continuous PDF on the edges. To ease reading, we present the detailed derivations of theanalog message passing in Appendix B and summarize themain results below.1) The marginal likelihood function f ( y | s [ i ] ) is a multi-variate complex Gaussian distribution of dimensionality M . The likelihood function f ( y | s + [ i ]) is a single-variatecomplex Gaussian distribution.2) A complex random variable can be viewed as a pair ofreal random variables (i.e., the real part and the imaginarypart of the complex random variable). Thus, we can denotethe M -dimensional complex Gaussian f ( y | s [ i ] ) by a M -dimensional real Gaussian f ( y | s [ i ] ) ∼ (21) N (cid:32) s [ i ] , µ s [ i ] = (cid:34) µ r s [ i ] µ i s [ i ] (cid:35) , Σ s [ i ] = (cid:34) Σ rr s [ i ] Σ ri s [ i ] Σ ir s [ i ] Σ ii s [ i ] (cid:35)(cid:33) , where µ s [ i ] is a M by real vector consists of the realand imaginary parts of the mean of s [ i ] . That is, µ r s [ i ] and µ i s [ i ] are the real and imaginary parts of sequence E [ s [ i ] ] (cid:62) . The covariance matrix Σ s [ i ] is a M by M covariance matrix. The moment parameters ( µ s [ i ] , Σ s [ i ] ) can be computed by a sum-product process described inAlgorithm 1 (see the detailed derivations in Appendix B).3) Given the marginal likelihood function f ( y | s [ i ] ) in (21), f ( y | s + [ i ]) can be constructed by f ( y | s + [ i ]) ∼ (22) N (cid:32) s + [ i ] , µ s + [ i ] = (cid:34) µ r s + [ i ] µ i s + [ i ] (cid:35) , Σ s + [ i ] = (cid:34) Σ rr s + [ i ] Σ ri s + [ i ] Σ ir s + [ i ] Σ ii s + [ i ] (cid:35)(cid:33) , where µ r s + [ i ] = (cid:62) µ r s [ i ] , µ i s + [ i ] = (cid:62) µ i s [ i ] , Σ rr s + [ i ] = (cid:62) Σ rr s [ i ] , Σ ri s + [ i ] = (cid:62) Σ ri s [ i ] , Σ ir s + [ i ] = (cid:62) Σ ir s [ i ] , Σ ii s + [ i ] = (cid:62) Σ ii s [ i ] .

4) Following (22), an SP-ML estimator can be designed, asgiven in Deﬁnition 2.

Deﬁnition 2 (SP-ML estimation) . The whitened matchedﬁltering and sampling scheme gives us the whitened samples y in (17) . To estimate the designed sequence s + , an SP-ML estimator ﬁrst computes the moment parameters of themultivariate Gaussian f ( y | s [ i ] ) , ∀ i , by an analog sum-product process given in Algorithm 1, and then estimates eachelement of s + by ˆ s ml + [ i ] = (cid:62) µ r s [ i ] + j (cid:62) µ i s [ i ] . (23) where µ s [ i ] = [ µ r s [ i ] , µ i s [ i ] ] (cid:62) is the mean of f ( y | s [ i ] ) . The reason behind (23) is as follows. It has been shown that f ( y | s + [ i ]) is conditional Gaussian. As per the ML rule, weshould choose the mean of f ( y | s + [ i ]) as the estimate of s + [ i ] as it maximizes the likelihood function. This gives us (23). Remark (MLSE versus BCJR) . The ML estimators in (13) and (18) aim to ﬁnd the maximum likelihood sequence s + in the space C L . In the language of digital communications, Algorithm 1

Analog message passing for ML estimation. Input:

Samples y and coefﬁcient matrix D . Output:

The marginal likelihood function f ( y | s [ i ] ) . for k = 1 , , ..., M and i = 1 , , ..., L + 1 do w k,i = V ( y k [ i ]) ; Compute the information about w k,i contained in eachsample y k [ i ] , i.e., f b ( w k,i ) , following (35). for i = 1 , , ..., L + 1 do for k = 1 , , ..., M do Compute the information about w k,i contained inall the samples { y k (cid:48) [ i (cid:48) ] : k (cid:48) < k, i (cid:48) < i } , i.e., f (cid:96) ( w k,i ) ,following (41) and (42). for i = L + 1 , L, ..., , do for k = M, M − , ..., , do Compute the information about w k,i contained inall the samples { y k (cid:48) [ i (cid:48) ] : k (cid:48) > k, i (cid:48) > i } , i.e., f (cid:48) r ( w k,i ) ,following (43) and (44). for i = 1 , , ..., L do Compute the marginal likelihood function f ( y | s [ i ] ) ,as per (46). they are ML optimal in the sense of maximum likelihoodsequence estimation (MLSE) [29]. On the other hand, the MLestimator in (23) aims to maximize the likelihood function ofeach element of s + . Thus, it is ML optimal in the BCJR [30]sense.When we perform the ML estimation in digital communica-tions, MLSE-optimal and BCJR-optimal are different criteriabecause the former minimizes the block error rate (BLER)and the latter minimizes the bit error rate (BER). For the MLestimation in OAC, however, the two criteria are equivalent.The reason for this discrepancy is again that the discreteconstellation transmitted in digital communications is a kindof prior information to the receiver while an OAC receiver hasno prior information at all.More speciﬁcally, let us consider the message passing inFig. 5. For the ML estimation in OAC, we have shown thatall the messages passed on the graph, including f ( y | s + ) and f ( y | s + [ i ]) , are Gaussian. Thus, the maximum likelihoodsequence s + also gives us the maximum likelihood s + [ i ] , ∀ i ,after marginalization. As a result, MLSE-optimal and BCJR-optimal are equivalent, and the ML estimators in (13) , (18) ,and (23) are identical. In contrast, when we perform the MLestimation in digital communications, the prior information isthat the transmitted symbol can only be one of the constella-tions. As a consequence, the messages passed on the graph areGaussian mixtures and both f ( y | s + ) and f ( y | s + [ i ]) are nolonger Gaussian distributed. As a result, MLSE-optimal andBCJR-optimal are different criteria. Computational complexity – Finally, we evaluate the com-putational complexity of the SP-ML estimator. With analogmessage passing in Algorithm 1, the messages passed on the graph are simply the parameters of the Gaussian distributioninstead of the continuous Gaussian PDF. The computationsinvolved in the analog message passing are simply 1) the sumof 2M-dimensional vectors/matrices, and 2) 2M-dimensionalmatrix inversion. Therefore, the computational complexity ofthe SP-ML estimator is Ω( LM log M ) . If we ﬁx M asa constant, the SP-ML estimator signiﬁcantly reduces thecomputational complexity of ML estimation from Ω( L log L ) to Ω( L ) . D. The Aligned-sample Estimator

As stated in the beginning of this section, another beneﬁtof the whitened matched ﬁltering and sampling scheme isthat it yields a sequence of samples wherein the indexes ofsymbols from different devices are consistent. Speciﬁcally, letus consider the outputs of the M -th matched ﬁlter.Let k = M in (15), we have V ( y M [ i ]) = { s [ i ] , s [ i ] , ...,s M [ i ] } , and y M [ i ] = M (cid:88) m =1 h (cid:48) M [ i ] s m [ i ] + z M [ i ] , (24)where z M [ i ] ∼ CN (0 , N /d M ) . As can be seen, unlike theoutputs of other matched ﬁlters, the neighbor symbols of y M [ i ] have the same index (that is, the symbol indexes arealigned within the integral interval of the M -th matched ﬁlter).Therefore, we can utilize the outputs of the M -th matchedﬁlter to devise an aligned-sample estimator. Deﬁnition 3 (The aligned-sample estimator for the misalignedOAC) . Given the outputs of the whitened matched ﬁlteringand sampling scheme { y k [ i ] } , the aligned-sample estimatorestimates the desired sequence s + ∈ C L symbol-by-symbol by ˆ s + [ i ] = y M [ i ] . (25)Eq. (24) is an underdetermined equation since we have oneequation for M unknowns, and the estimator in (25) is our bestprediction about ˆ s + [ i ] . When there is no or mild channel-gainmisalignment (i.e., h (cid:48) M → ), the estimator (25) is supposedto perform very well.V. S IMULATION R ESULTS

This section evaluates the system performance of federatedlearning with the misaligned OAC considering two estimatorsat the receiver: the ML estimator and the aligned-sampleestimator. In particular, for ML estimation, we use our SP-ML estimator since the ML estimators in (13) and (18) arecomputationally prohibitive. We implement a federated edgelearning system wherein devices collaboratively train aconvolution neural network (CNN) to solve the CIFAR-10classiﬁcation task [31]. The CIFAR-10 dataset has a trainingset of , examples and a test set of , examplesin 10 classes. Each example is a × colour image. Thenon-i.i.d. training examples are assigned to the devices inthe following manner: 1) we ﬁrst let each device randomly We have performed additional simulations to validate that the threeestimators in (13), (18), and (23) are equivalent (using a much shorter packetlength L ). The simulation results are omitted here to conserve space. Figure 6. Test accuracy of the learned model over the course of training,with and without noise. There is no time or channel-gain misalignment. Weuse the ML estimator at the PS. sample , samples from the dataset; 2) for the rest , examples, we sort them by their labels and divide them into shards of size [6]. Each device is then assigned withone shard.The implemented CNN is a ShufﬂeNet V2 network [32]with d = 1 . × parameters (this corresponds to . × complex values). In each iteration, we assume M = 4 devicesare active and participate in the training. Each device willtrain the global model locally for epochs and then transmitthe model-updates to the PS in packets (the packet lengthis L = 1 . × ) in each iteration. The packets fromdifferent transmitters overlap at the PS with time and channel-gain misalignments, and the PS employs the ML and aligned-sample estimators to estimate the arithmetic sum of transmittedsymbols, i.e., s + (and hence θ + ). The estimated arithmeticsum θ + is then used to update the global model, as per (2).All the source codes are available online at [33].The metric we use to assess the performance of an estimatoris the test accuracy . Speciﬁcally, when operated with a givenestimator, we will train the global model for iterations andtake the prediction accuracy of the learned model on the testset as the performance indicator of the estimator. An exampleis given in Fig. 6.As shown in Fig. 6, we run the federated learning systemfor iterations with and without noise, and plot the test ac-curacy over the course of training. There are no misalignmentsin this simulation and ML estimator is used at the PS. First,the dark curve corresponds to the noiseless case and the testaccuracy after iterations is . . We point out that thisis the global optimal test accuracy since the MAC is ideal, i.e.,there is no time misalignment, channel-gain misalignment, ornoise in the MAC. The other two curves in Fig. 6 correspondto the learning performance when noise is presented in thereceived signal. In particular, noise is added according to agiven EsN0, i.e., the average received energy per symbol tonoise power spectral density ratio, givingEsN0 = E i (cid:20)(cid:12)(cid:12)(cid:12)(cid:80) Mm =1 h (cid:48) m s m [ i ] (cid:12)(cid:12)(cid:12) (cid:21) N . (26)As shown, noise is detrimental to the test accuracy after con- Figure 7. Test accuracy (after iterations) of the asynchronous OAC undervarious EsN0. There is no channel-gain misalignment in the simulation andwe use both the ML estimator and the aligned-sample estimator. vergence. Compared with the noiseless case, the test accuracydegrades for . with an EsN0 of − dB, and for . with an EsN0 of − dB.In addition to noise, we next introduce time misalignmentinto the received signal. The received signal is given in (15),where the noise term z [ i ] ∼ CN (0 , N /d k ) and we set theresidual channel gain h (cid:48) m = 1 , ∀ m . Without loss of generality,the symbol duration is set to T = 1 . As can be seen from(15), time offsets { τ m : m = 1 , , ..., M } determine the noisepower of samples from different matched ﬁlters (since d k = τ k +1 − τ k ). In the simulation, the time offsets τ m , ∀ m , are setin the following manner: ﬁrst, we ﬁx a maximum time offset τ M (and hence d M ); then, we generate the other time offsetsuniformly in (0 , τ M ) .The simulation results are presented in Fig. 7. When τ M =0 , there is neither time nor channel-gain misalignment in thereceived signal. The ML estimator and the aligned-sampleestimator are equivalent in this case, and hence they yieldthe same test accuracy. As we increase τ M , the performanceof both estimators deteriorates.1) For the aligned-sample estimator, the performance deterio-ration is easy to understand since the inputs to the estimatorare the outputs of the M -th matched ﬁlter y M [ i ] . As aresult, the performance of the aligned-sample estimator isgoverned by the maximum time offset τ M – the largerthe τ M , the worse the performance. As shown in Fig. 7,the introduction of time misalignment results in an EsN0penalty for the aligned-sample estimator. When τ M = 0 . ,the EsN0 penalty is dB because d M is reduced by afactor of (from to . ). Likewise, the EsN0 penalty is dB when τ M = 0 . since d M is reduced by times(from to . ).2) For different τ M , the performance gain of the ML esti-mator over the aligned-sample estimator is negligible. Thealigned-sample estimator utilizes only the outputs of the M -th matched ﬁlter. The ML estimator, on the other hand,utilizes the outputs of all matched ﬁlters and attemptsto estimate the mostly likely s + . It turns out that bothestimator yields nearly the same performance when thereis only time misalignment and noise. Figure 8. Test accuracies of the ML and aligned-sample estimators underdifferent degrees of phase misalignments. The maximum time offset τ M =0 . and the maximum phase misalignment φ = 0 (no), π/ (mild), π/ (moderate), and π (severe), respectively. It should be noted that the above result does not implythat the samples of the matched ﬁlters other than the M -th are useless, because the ML estimator cannot achieve thesame performance as the aligned-sample estimator using onlythe outputs of the M -th matched ﬁlter: recall from (25) thatthe outputs of the M -th matched ﬁlter are underdeterminedequations. If we perform ML estimation based on the samplesin (25), the estimation error can be arbitrarily large as s + [ i ] can be any value. Remark (error propagation) . In the misaligned OAC, MLestimation boils down to MUD and faces an inﬁnitely largeestimation space. It is then very susceptible to noise andsuffers from error propagation. Take the SP-ML estimatorfor instance. In the forward message passing, the successfulestimation of a likelihood function (of a multivariate variable)hinges on the accurate estimations of the likelihood functionson the left. When a sample is contaminated by noise, the meanof the likelihood function deviates from the true value of thenoiseless sample. This estimation error will be propagatedalong the tree all the way to the rightmost leaf, because thereare no known messages (i.e., prior information) in between toalleviate/correct the error. This can be one cause of the resultsin Fig. 7.

In the third simulation, let us further introduce channel-gainmisalignment into the received signal (15). For each device,the residual channel gain h m = | h m | e jφ m . Let us set | h m | =1 , ∀ m , in the simulation and focus on the impact of thephase offsets φ m only. In particular, we assume { φ m : m =1 , , ..., M } are uniformly distributed in (0 , φ ) and φ is themaximum phase offset (i.e., φ m ∼ U (0 , φ ) ). It is worth notingthat φ m can be any distribution in general.Fig. 8 presents the test accuracy of the ML and aligned-sample estimators versus EsN0 (in dB), wherein the maximumtime offset τ M is ﬁxed to . and the maximum phase offset φ = 0 (no phase misalignment), π/ (mild), π/ (moderate),and π (severe), respectively. Remark.

A caveat here is that φ is the maximum phase offset– the phase offsets of all devices are uniformly distributed Figure 9. Test accuracies of the ML and aligned-sample estimators underdifferent degrees of phase misalignments. The maximum time offset τ =0 . and the maximum phase misalignment φ = 0 (no), π/ (mild), π/ (moderate), and π (severe), respectively. in [0 , φ ] . If we look at the phase misalignment between anytwo devices, however, the average pairwise-phase-misalignm-ent is only φ/ . That is why we classify π/ as the mild phasemisalignment because the average pairwise-phase-misalign-ment is only π/ . We have the following observations from Fig. 8:1) When the maximum phase misalignment φ = 0 , the per-formance curves of the ML and aligned-sample estimatorsare copied from Fig. 7, and they coincide with each other.2) When there is mild phase misalignment ( φ = π/ ), thealigned-sample estimator suffers from two penalties: i) asmall EsN0 penalty for about dB, i.e., we need a dBhigher EsN0 to achieve the same test accuracy; ii) a . test-accuracy loss, i.e., the test accuracy after convergenceis . less than the phase-aligned case.The ML estimator, on the other hand, suffers from alarge EsN0 penalty. The reason behind is that the phasemisalignment enhances the error/noise propagation in MLestimation, which we refer to as noise enhancement . Asa result, the ML estimation does not work in the low-EsN0 regime when there is phase misalignment. On thebright side, ML estimator performs very well in the high-EsN0 regime: it suffers from no test-accuracy loss – thetest accuracy after convergence is the same as the phase-aligned case.3) When we further increase the maximum phase misalign-ment φ , the aligned-sample estimator suffers from largerEsN0 and test-accuracy penalties. In the case of moderatephase misalignment ( φ = 3 π/ ), the test-accuracy penaltyis up to . . In the case of severe phase misalignment( φ = π ), the learning diverges with the aligned-sampleestimator.In contrast, the ML estimator is more robust to the mod-erate and severe phase misalignments in the high-EsN0regime – there is no test-accuracy loss and only a smallEsN0 penalty.Fig. 8 studies the impact of the phase misalignment undera ﬁxed maximum time offset τ M = 0 . . Next, we considera larger time offset τ M = 0 . and repeat the simulations in Fig. 8. The simulation results are presented in Fig. 9.For the aligned-sample estimator, time offset only incursan EsN0 penalty. Thus, a performance curve of the aligned-sample estimator in Fig. 9 is simply a right-shift of thecorresponding curve in Fig. 8 by dB.For the ML estimator, an interesting observation is that theML estimation beneﬁts from larger time misalignment whenthere is phase misalignment. For example, with a mild phasemisalignment ( φ = π/ ), the EsN0 gain is about dB whenthe maximum time offset τ M is increased from . to . ,as shown in Fig. 8 and Fig. 9. In contrast, when there is nophase misalignment, ML estimation suffers from larger timemisalignment, as shown in Fig. 7.To conclude this section, we summarize the main simulationresults as follows.1) When there is no phase misalignment, the ML and aligned-sample estimators are on equal footing as far as thelearning performance is concerned.2) When there is mild or moderate phase misalignment, thealigned-sample estimator outperforms the ML estimator inthe low-EsN0 regime, but is worse than the ML estimatorin the high-EsN0 regime.3) When there is severe phase misalignment, the aligned es-timator leads to divergence of learning, but ML estimationstill works in the high-EsN0 regime.VI. C ONCLUSION

As a joint computation-and-communication technique, over-the-air Computation (OAC) exploits the property of themultiple-access channel (MAC) that its output is the arithmeticsum of the inputs. OAC is an efﬁcient scheme to speed up theuplink aggregation of models from the edge devices in federateedge learning.This paper ﬁlled the research gap of the misaligned OAC bydevising two estimators, a sum-product maximum likelihood(SP-ML) estimator and an aligned-sample estimator, to esti-mate the arithmetic sum of the symbols from different devicesin the face of channel-gain and time misalignments.The underpinning of the proposed estimators is an over-sampled matched ﬁltering and sampling scheme that yields: a)whitened samples with alleviated inter-symbol and inter-userinterferences; b) a subsequence of samples wherein the indexesof transmitted symbols from different devices are aligned.

The ML estimator – The whitened samples in a) allows usto construct a factor graph with a simple structure to representthe compositions of the samples, whereby an SP-ML estimatorcan be devised to compute the ML estimate of the arithmeticsum from an analog message passing process. Compared withconventional ML estimator that is computationally prohibitive,the complexity of the SP-ML estimator grows linearly with thepacket length and hence is computationally efﬁcient.In the OAC systems, symbols transmitted from the edgedevices are continuous values instead of discrete constella-tions. Therefore, we have no prior information about thetransmitted symbols and ML estimation is the only option.In this context, the arithmetic-sum estimation boils down tothe multi-user decoding/estimation (MUD) and the estimation space is inﬁnitely large. Two problems with the ML estimationare error propagation and noise enhancement. Speciﬁcally, theestimation error introduced by noise in a sample can propagateto other samples, causing larger and larger deviations fromthe true values for all the samples in between. The errorpropagation is further intensiﬁed by phase misalignment sincethe error/noise can be ampliﬁed in the propagation.As a result, ML estimation does not perform well in the lowEsN0 regime, especially when there is phase misalignment. Toaddress this problem, a possible solution is to insert pilots inthe transmitted symbols to cut off the error propagation. The aligned-sample estimator – The subsequence of the“aligned” samples in b), on the other hand, allows an aligned-sample estimator to be used in the misaligned OAC. Theupside of the aligned-sample estimator is that it does nothave the error propagation and noise enhancement issues,and hence is a good alternative to the ML estimator inthe low-EsN0 regime. The downsides, however, are that itsuffers from both phase misalignment and time misalignment– phase misalignment causes a test-accuracy loss and timemisalignment causes an EsN0 penalty to the aligned-sampleestimator.The computational complexities of both ML and aligned-sample estimators grow linearly with the packet length. Thealigned-sample estimator is preferred in the low-EsN0 regimeand the ML estimator is preferred in the high-EsN0 regime.

Time-domain realization versus frequency-domain real-ization – This paper considered a time-domain realization ofOAC. OAC can also be realized in the frequency domain viaOFDM. An interesting direction to extend the current work isto compare the two realizations and their abilities to combatthe misalignments. We remark that our study in this paper alsosheds light on the frequency-domain realization of OAC. Abrief comparison between the two realizations is given belowto provide some operational insights.Time-domain or frequency-domain realization of wirelesscommunication systems has been a long-standing debate.When it concerns misalignments, the time-domain realizationis sensitive to the time offset among edge devices whilethe frequency-domain realization is sensitive to the carrierfrequency offset (CFO) among edge devices.1) Time misalignment. With the time-domain realization, timemisalignment leads to an EsN0 penalty to the learningperformance, as veriﬁed in Section V. OFDM, on the otherhand, deliberately introduces some redundancies known asthe cyclic preﬁx (CP) and transforms the time offset ofeach device τ m to the frequency domain as extra phaseoffsets e j π(cid:96)τ m /LT , (cid:96) = 1 , , ..., L , on the L subcarriers. Ifwe look at one subcarrier, it is equivalently a synchronoustime-domain realization with phase misalignments – thefrequency domain samples are the same as (24) by setting h (cid:48) m = e j π(cid:96)τ m /LT and d M = T . It is then an under-determined equation and we can use the aligned-sampleestimator in Deﬁnition 3 to estimate the arithmetic sum.We emphasize that the phase misalignment can be severefor large τ M (maximum time offset) and (cid:96) (subcarrierindex). For example, when τ M = T and (cid:96) = L , the phasemisalignment is up to π . When there is no CFO, the performance of the OFDMsystem for the misaligned OAC can then be predicted bythe performance of the aligned-sample estimator in Fig. 8after left-shifting for dB (i.e., let τ M = 0 ).To summarize, OFDM systems introduce redundancies totransform the time misalignment to the phase misalign-ment. It is trading off test-accuracy loss for EsN0 loss. Itis worth noting that we can also insert some redundancies,e.g., pilots, in the time-domain realizations to improve theperformance of ML estimation2) CFO. As shown in Appendix A, for a time-domain re-alization, residual CFO in the overlapped signal simplyintroduces additional channel-gain misalignments amongdevices. For a frequency-domain realization, however, CFOintroduces inter-carrier interference (ICI), a dual problemof the inter-symbol and inter-user interferences in the time-domain realization of OAC. As a result, we have to devisethe ML estimator to combat the ICI, and the ML estimationin OFDM with residual CFO falls into the same scope ofthe ML estimation in the time-domain realization with timemisalignment. In this context, the analysis in this paperabout the properties of ML estimation for the misalignedOAC still holds, and the SP-ML estimator devised inSection IV can also be used in the OFDM systems toperform ML estimation.A PPENDIX

AIn this Appendix, we generalize the system model in SectionII by considering fast fading channels ˜ h m ( t ) and residual CFOin the received signal r ( t ) .At each transmitter, to compensate the fast channel fadingand CFO between the transmitter and the receiver, α m in(3) is designed as α m ( t ) = e − j ¯ ε m t / ¯ h m ( t ) , where ¯ ε m is theestimated CFO at the m -th device.The received signal r ( t ) is then given by r ( t ) = M (cid:88) m =1 ˜ h m ( t ) e j ˜ ε m t x m ( t − τ m ) + z ( t ) . (27)Unlike (4), each ˜ h m ( t ) is now a fast fading channel and ˜ ε m is the CFO between the m -th device and the PS.Substituting (3) into (27) gives us r ( t ) = M (cid:88) m =1 ˜ h m ( t ) e j ˜ ε m t α m ( t ) L (cid:88) (cid:96) =1 s m [ (cid:96) ] p ( t − τ m − (cid:96)T ) + z ( t )= L (cid:88) (cid:96) =1 M (cid:88) m =1 h (cid:48) m ( t ) e jε (cid:48) m t s m [ (cid:96) ] p ( t − τ m − (cid:96)T ) + z ( t ) , (28)where h (cid:48) m ( t ) = ˜ h m ( t ) / ¯ h m ( t ) is the residual channel-fadingcoefﬁcient and ε (cid:48) m = ˜ ε m − ¯ ε m is the residual CFO betweenthe m -th device and the PS. As can be seen, CFO introducesadditional channel-gain misalignments among devices.With the revised signal r ( t ) , the discrete samples can bewritten in the same form as (8) with different coefﬁcients c m,k [ i ] and c (cid:48) m,k [ i ] , giving, c m,k [ i ] = 1 T (cid:90) ( i − m>k ) T + τ m ( i − T + τ k h (cid:48) m ( t ) e jε (cid:48) m ζ dζ = 1 jε (cid:48) m T × (29) (cid:16) e jε (cid:48) m [( i − m>k ) T + τ m ] − e jε (cid:48) m [( i − T + τ k ] (cid:17) (cid:90) ( i − m>k ) T + τ m ( i − T + τ k h (cid:48) m ( t ) dt,c (cid:48) m,k [ i ] = 1 T (cid:90) iT + τ k ( i − m>k ) T + τ m h (cid:48) m ( t ) e jε (cid:48) m ζ dζ = 1 jε (cid:48) m T × (30) (cid:16) e jε (cid:48) m [ iT + τ k ] − e jε (cid:48) m ( i − m>k ) T + τ m (cid:17) (cid:90) iT + τ k ( i − m>k ) T + τ m h (cid:48) m ( t ) dt. On the other hand, for our whitened matched ﬁltering andsampling scheme, the samples y k [ i ] in (15) can be revised to y k [ i ] = M (cid:88) m =1 g m,k [ i ] s m [ i − m>k ] + ˜ z k [ i ] , (31)where g m,k [ i ] = h (cid:48) m d k (cid:90) ( i − T + τ k +1 ( i − T + τ k e jε (cid:48) m ζ dζ = h (cid:48) m d k e jε (cid:48) m d k − jε (cid:48) m e jε (cid:48) m [( i − T + τ k ] . As a result, if the channel is fast fading and there isresidual CFO in the received signal, we can simply modifythe coefﬁcient matrix D as D = (32)  g , [1] g , [1] g , [1] ... g , [1] ...g ,M [1] ... ... g M, [1] g ,M [1] ... g M, [1] g , [2] ... ... g , [2] g , [2] g M,M [1] ... g , [2] ...g ,M [2] ... ... g M, [2] g ,M [2] ... g M, [2] ...... ... ...g M,M [2] ......  . A PPENDIX

BThis appendix proves that the marginal likelihood function f ( y | s [ i ] ) is Gaussian and derive its moment parameters by asum-product process.To begin with, we point out that a multivariate Gaussiandistribution can be parameterized by two sets of parameters[34], [35]: the moment parameter ( µ , Σ ) and the canonical pa-rameter ( η , Λ ) . The two sets of parameters can be transformedinto one another and they are useful in different circumstances,as detailed below.1. The moment parameters ( µ , Σ ) . For a multivariate realGaussian random variable w of dimension M , its momentparameters are deﬁned as µ = E [ w ] , Σ = E (cid:2) ( w − µ )( w − µ ) (cid:62) (cid:3) . The moment form of the Gaussian distribution is given by N ( w ; µ , Σ ) = 1(2 π ) M | Σ | exp (cid:26) −

12 ( w − µ ) (cid:62) Σ † ( w − µ ) (cid:27) . The moment parameters are useful when we marginalizea multivariate Gaussian [35]. Let w ∼ N ( w ; µ , Σ ) bea multivariate Gaussian random variable of dimension M with the moment parameters being ( µ , Σ ) . Let us partition w = [ v , v ] (cid:62) where w , w are multivariate Gaussians of dimension κ and M − κ , respectively. The moment parameterscan be partitioned accordingly as µ = (cid:20) µ µ (cid:21) , Σ = (cid:20) Σ Σ Σ Σ (cid:21) . If we marginalize out w from w , the marginal f ( w ) is stilla Gaussian distribution, giving f ( w ) = (cid:90) w N (cid:18)(cid:20) w w (cid:21) ; (cid:20) µ µ (cid:21) , (cid:20) Σ Σ Σ Σ (cid:21)(cid:19) d w ∝ N ( w ; µ , Σ ) . (33)2. The canonical parameters ( η , Λ ) . For a Gaussian randomvariable w of dimension M , its canonical parameters aredeﬁned as η = Σ † µ , Λ = Σ † . The canonical form of the Gaussian distribution is given by N ( w ; η , Λ ) = exp (cid:26) − w (cid:62) Λ w + w (cid:62) η + ρ (cid:27) , where ρ is a constant ρ = − (cid:0) M ln 2 π − ln | Λ | + η (cid:62) Λ † η (cid:1) . The canonical parameters are useful when we multiply twoor more Gaussians [36]. Let { w k : k = 1 , , ..., K } , w k ∼N ( w ; η k , Λ k ) be a set of multivariate real Gaussian randomvariable of dimension M . Then, the product of them is stilla Gaussian with the new canonical parameters being the sumof the canonical parameters of the original K Gaussians. K (cid:89) k =1 N ( w k ; η k , Λ k ) ∝ N (cid:32) w ; K (cid:88) k =1 η k , K (cid:88) k =1 Λ k (cid:33) (34) = exp (cid:40) − w (cid:62) K (cid:88) k =1 Λ k w + w (cid:62) K (cid:88) k =1 η k + K (cid:88) k =1 ρ k (cid:41) . Let us set out to show that the marginal likelihood function f ( y | s [ i ] ) is an M -dimensional complex Gaussian distribution.Now that the factor graph in Fig. 5 has a tree structure, weonly need to pass the messages from left to right (forwardmessage passing) and then from right to left (backward mes-sage passing). Each message needs to be computed once andonly once, after which the exact marginal posterior distributionconverges. Forward Message Passing – We ﬁrst investigate how themessages are passed from left to right on Fig. 5. Without lossof generality, we shall focus on the message passing from onevariable W k,i to another variable W k +1 ,i on the right.Notice that W k,i = V ( y k [ i ]) = { s [ i ] , ..., s k [ i ] , s k +1 [ i − ,s k +2 [ i − , ..., s M [ i − } and W k +1 ,i = V ( y k +1 [ i ]) = { s [ i ] , ..., s k [ i ] , s k +1 [ i ] , s k +2 [ i − , ..., s M [ i − } . Thus, theonly different symbol between W k,i and W k +1 ,i is the ( k +1) -th symbol. We consider each complex random variable s k [ i ] as a pair of real random variables: the real part andthe imaginary part. Then, each W k,i can be viewed as a M -dimensional real random variable. To simplify notations,we denote the M -dimensional real variates corresponding to W k,i and W k +1 ,i , respectively, by 𝑦 𝑘,𝑖 𝑦 𝑘+1,𝑖 ǁ𝑧 𝑘,𝑖 ǁ𝑧 𝑘+1,𝑖 + + = =×× 𝒉 ⊤ 𝒘 𝒌,𝒊 = 𝑏 …𝑏 𝑘𝑟 𝑏 𝑘+1𝑟 𝑏 𝑘+2𝑟 …𝑏 𝑀𝑟 ,𝑏 …𝑏 𝑘𝑖 𝑏 𝑘+1𝑖 𝑏 𝑘+2𝑖 …𝑏 𝑀𝑖 𝛿 𝒉 ⊤ 𝑓 𝑏 𝒘 𝒌,𝒊 𝑓 𝑙 𝒘 𝒌,𝒊 𝑓 𝑟 𝒘 𝒌,𝒊 𝑓 𝑙 𝒘 𝒌+𝟏,𝒊 𝒘 𝒌+𝟏,𝒊 = 𝑏 …𝑏 𝑘𝑟 𝑐 𝑘+1𝑟 𝑏 𝑘+2𝑟 …𝑏 𝑀𝑟 ,𝑏 …𝑏 𝑘𝑖 𝑐 𝑘+1𝑖 𝑏 𝑘+2𝑖 …𝑏 𝑀𝑖 𝑓 𝑏 𝒘 𝒌+𝟏,𝒊 𝑓 𝑟′ 𝒘 𝒌,𝒊 𝑓 𝑙′ 𝒘 𝒌+𝟏,𝒊 𝑓 𝑟′ 𝒘 𝒌+𝟏,𝒊 Figure 10. The forward message passing from W k,i to W k +1 ,i (in blue)and the backward message passing from W k +1 ,i to W k,i (in green). w k,i = (cid:16) b r , ..., b r k , b r k +1 , b r k +2 , ..., b r M , b i , ..., b i k , b i k +1 , b i k +2 , ..., b i M (cid:17) , w k +1 ,i = (cid:16) b r , ..., b r k , c r k +1 , b r k +2 , ..., b r M , b i , ..., b i k , c i k +1 , b i k +2 , ..., b i M (cid:17) , as shown in Fig. 10.On the left half of Fig. 10, there are three edges centeredaround the equality function “ = ” (marked in blue). Due to theequality constraint, these four edges are associated with thesame high-dimensional variable w k,i . In the forward messagepassing, there are three messages to be computed.1) The message passed from the bottom, denoted by f b ( w k,i ) . This message carries the information about w k,i contained in the sample y k,i . As per (15), y k,i = M (cid:88) m =1 ( h r m + jh i m )( b r m + jb i m ) + ˜ z r k,i + j ˜ z i k,i = M (cid:88) m =1 ( h r m b r m − h i m b i m )+ ˜ z r k,i + j M (cid:88) m =1 ( h r m b i m + h i m b r m )+ j ˜ z i k,i , where ˜ z r k,i , ˜ z i k,i ∼ N (0 , N d k ) . Thus, the likelihood function f ( y k,i | w k,i ) is Gaussian, giving f ( y k,i | w k,i ) ∝ exp (cid:40) − d k N [ y r k,i − (cid:88) m ( h r m b r m − h i m b i m )] (cid:41) × exp (cid:40) − d k N [ y i k,i − (cid:88) m ( h r m b i m + h i m b r m )] (cid:41) . When we pass the information bottom up, y k,i is our obser-vation (hence a constant) and w k,i is the variable. Therefore, f b ( w k,i ) = f ( y k,i | w k,i ) . After some manipulations, we canwrite f b ( w k,i ) as a M -dimensional Gaussian distribution,giving, f b ( w k,i ) ∝ N ( w k,i , η b , Σ b ) , (35)where η b and Σ b are deﬁned as η b = 2 d k N (cid:20) β β (cid:21) (cid:20) y r k,i y i k,i (cid:21) , Σ b = 2 d k N (cid:20) β β (cid:62) β β (cid:62) β β (cid:62) β β (cid:62) (cid:21) , (36) and the matrices β and β are composed of channel coefﬁ-cients, giving β =  h r h i h r h i ... ...h r M h i M  , β =  − h i h r − h i h r ... ... − h i M h r M  . In (36), we have assumed that the dimensionality of w k,i is M , that is, y k,i = y k [ i ] is related to M complex variables.However, this is only valid when the number of neighborsymbols of y k [ i ] is M . As shown in (20), |V ( y k [ i ]) | = M only when < i ≤ L . Thus, we can compute η b and Σ b by(36) only when < i ≤ L .For the boundary samples ( i = 1 , L + 1 ) whose neighborsymbols are less than M , we further multiply the parameters η b and Σ b in (36) by an indicator vector γ and an indicatormatrix Γ , respectively, to ensure that f b ( w k,i ) does not containinformation about the symbols that do not belong to V ( y k [ i ]) .The general form of η b and Σ b are η b = 2 d k N (cid:20) β β (cid:21) (cid:20) y r k,i y i k,i (cid:21) ◦ γ k , (37) Σ b = 2 d k N (cid:20) β β (cid:62) β β (cid:62) β β (cid:62) β β (cid:62) (cid:21) ◦ Γ k , (38)where ◦ is an elementwise multiplication. The indicator vector γ and the indicator matrix Γ are deﬁned as follows.First, for the ﬁrst M samples (i.e., i = 1 ), we have |V ( y k [ i ]) | = k from (20). Thus, we deﬁne γ k = (cid:34) k × ( M − k ) × k × ( M − k ) × (cid:35) , Γ k = (cid:34) k × k k × ( M − k ) k × k k × ( M − k ) ( M − k ) × k ( M − k ) × ( M − k ) ( M − k ) × k ( M − k ) × ( M − k ) k × k k × ( M − k ) k × k k × ( M − k ) ( M − k ) × k ( M − k ) × ( M − k ) ( M − k ) × k ( M − k ) × ( M − k ) (cid:35) , where and are all-ones and all-zero matrices with sub-scripts being their dimensionalities.Second, for the last M samples (i.e., i = L + 1 ), we have |V ( y k [ i ]) | = M − k from (20). Thus, we deﬁne γ k = (cid:34) k × ( M − k ) × k × ( M − k ) × (cid:35) , Γ k = (cid:34) k × k k × ( M − k ) k × k k × ( M − k ) ( M − k ) × k ( M − k ) × ( M − k ) ( M − k ) × k ( M − k ) × ( M − k ) k × k k × ( M − k ) k × k k × ( M − k ) ( M − k ) × k ( M − k ) × ( M − k ) ( M − k ) × k ( M − k ) × ( M − k ) (cid:35) . Finally, for all other samples ( < i ≤ L ), we simply set γ k = M × , Γ k = M × M . This is consistent with (36).2) The second message, denoted by f (cid:96) ( w k,i ) in Fig. 10, isthe message passed from w k − ,i on the left. This message isobtained in the same way as f (cid:96) ( w k +1 ,i ) and we will analyzeit at the end of the forward message passing. For now, let usassume it is Gaussian and denote it by f (cid:96) ( w k,i ) ∝ N ( w k,i ; η (cid:96) , Λ (cid:96) ) . (39)

3) As per the sum-product rule, the message out of a localfunction along an edge is the product of all incoming messagesto this local function along all other edges. Thus, the messageout of the equality function “ = ”, denoted by f r ( w k,i ) inFig. 10, can be obtained by f r ( w k,i ) = f b ( w k,i ) f (cid:96) ( w k,i ) . (40)This is the “product” step of the sum-product algorithm. FromLemma (34), we know f r ( w k,i ) is still a Gaussian distribution,and f r ( w k,i ) ∝ N ( w k,i ; η r , Λ r ) , (41)where η r = η b + η (cid:96) , Λ r = Λ b + Λ (cid:96) . We emphasize thatthis message is an aggregation of all known information about w k,i from the left side of the graph.The next step is to pass the message f r ( w k,i ) through thecompatibility function δ . Notice that the compatibility functionconnects two different variables: on the LHS, the variableassociated with the edge is w k,i ; on the RHS, the variableassociated with the edge is w k +1 ,i . Therefore, we have tointegrate f r ( w k,i ) over all variates that are in w k,i but not in w k +1 ,i . This is the “sum” step of the sum-product algorithm.Notice that the common variates of w k,i and w k +1 ,i are w ∩ = (cid:0) b r , ..., b r k , b r k +2 , ..., b r M , b i , ..., b i k , b i k +2 , ..., b i M (cid:1) , and the two different variates are b r k +1 and b i k +1 – in w k +1 ,i ,these two variates are c r k +1 and c i k +1 .Let us integrate f r ( w k,i ) over b r k +1 and b i k +1 , giving, f ( w ∩ ) = (cid:90) b r k +1 (cid:90) b i k +1 f r ( w k,i ) db r k +1 db i k +1 . As per (33), f ( w ∩ ) is Gaussian distributed. In particular, ifwe write f r ( w k,i ) and f ( w ∩ ) in moment form as f r ( w k,i ) ∝ N ( w k,i ; µ r , Σ r ) ,f r ( w ∩ ) ∝ N ( w ∩ ; µ ∩ , Σ ∩ ) , then µ ∩ can be obtained by deleting the ( k + 1 )-th and ( k +1 + M )-th rows of µ r ; Σ ∩ can be obtained by deleting the( k + 1 )-th and ( k + 1 + M )-th rows and columns of Σ r .However, w ∩ is a ( M − ) dimensional variable. To obtain f (cid:96) ( w k +1 ,i ) , we have to expand the dimensionality of w ∩ byadding c r k +1 and c i k +1 in the ( k + 1 )-th and ( k + 1 + M )-th positions. After expansion, f (cid:96) ( w k +1 ,i ) is still multivariateGaussian. f (cid:96) ( w k +1 ,i ) ∝ N ( w k +1 ,i ; µ (cid:96) , Σ (cid:96) ) , (42)where µ (cid:96) can be obtained by adding two zeros to µ ∩ ; and Σ (cid:96) can be obtained by adding two all-zero rows and two all-zerocolumns to Σ ∩ .To summarize, we have shown that all the messages in-volved in the forward message passing are M -dimensionaljoint Gaussian distributed and can be parameterized by (35),(39), (41), (42), respectively. Backward message passing – Our tree structure is sym-metric. Thus, the backward message passing is symmetricto the forward message passing. As shown in Fig. 10, topass the messages from w k +1 ,i and w k,i . We ﬁrst compute 𝑦 𝑀,𝑖 ǁ𝑧 𝑘,𝑖 +=× 𝒉 ⊤ 𝑓 𝑟′ 𝒘 𝒌,𝒊 𝑓 𝑙 𝒘 𝒌,𝒊 𝑓 𝑏 𝒘 𝒌,𝒊 𝑓 𝒚|𝒔 𝒊 Figure 11. The marginalization process in the sum-product algorithm. two incoming messages f b ( w k +1 ,i ) and f (cid:48) r ( w k +1 ,i ) , where f b ( w k +1 ,i ) is the same as that in the forward message passingand f (cid:48) r ( w k +1 ,i ) is the message passed from w k +2 ,i on theright.Then, f (cid:48) (cid:96) ( w k +1 ,i ) and f (cid:48) r ( w k,i ) are computed from “prod-uct” and “sum”, respectively, by f (cid:48) (cid:96) ( w k +1 ,i ) = f b ( w k +1 ,i ) f (cid:48) r ( w k +1 ,i ) , (43) f (cid:48) r ( w k,i ) = (cid:90) c r k +1 (cid:90) c i k +1 f (cid:48) (cid:96) ( w k,i ) dc r k +1 dc i k +1 . (44) Marginalization – After one forward message passing fromleft to right and one backward message passing from right toleft, the marginal likelihood function of each variable w k,i converges and can be computed by f ( w k,i | y , y , ..., y M ) = f b ( w k,i ) f (cid:96) ( w k,i ) f (cid:48) r ( w k,i ) , (45)as illustrated in Fig. 11. Therefore, f ( w k,i | y , y , ..., y M ) isa M -dimensional real Gaussian distribution. In particular, ifwe write the four messages on the RHS of (45) in the canonicalform, then the canonical parameters of f ( w k,i | y , y , ..., y M ) is a sum of them.Recall that w k,i = ( b r , ..., b r M , b i , ..., b i M ) is a M dimen-sional real random variable where b r m and b i m are the realand imaginary parts of the m -th complex element of W k,i = V ( y k [ i ]) = ( s [ i ] , s [ i ] , ..., s k [ i ] , s k +1 [ i − , s k +2 [ i − , ...,s M [ i − , thus, f ( y | W k,i ) is an M -dimensional complexGaussian distribution.Let k = M , we have W M,i = ( s [ i ] , s [ i ] , ..., s M [ i ]) = s [ i ] . This means f ( y | s [ i ] ) = f ( y | W k,i ) , (46)is an M -dimensional complex Gaussian distribution, the meanand covariance of which can be computed from (45).R EFERENCES[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, p. 436, 2015.[2] D. G¨und¨uz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. R. Murthy,and M. van der Schaar, “Machine learning in the air,”

IEEE J. Sel. AreasCommun. , vol. 37, no. 10, pp. 2184–2199, 2019.[3] Y. Shao, A. Rezaee, S. C. Liew, and V. Chan, “Signiﬁcant sampling forshortest path routing: a deep reinforcement learning solution,”

IEEE J.Sel. Areas Commun. , vol. 38, no. 10, pp. 2234 – 2248, 2020.[4] D. Gunduz, D. B. Kurka, M. Jankowski, M. M. Amiri, E. Ozfatura,and S. Sreekumar, “Communicate to learn at the edge,”

IEEE Commun.Magazine , vol. 58, no. 12, pp. 14–19, 2020. [5] J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, andD. Bacon, “Federated learning: strategies for improving communicationefﬁciency,” arXiv preprint, arXiv:1610.05492 , 2016.[6] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-efﬁcient learning of deep networks from decentralizeddata,” in AI Statistics . PMLR, 2017, pp. 1273–1282.[7] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,V. Ivanov, C. Kiddon, J. Koneˇcn`y, S. Mazzocchi, H. B. McMahan et al. ,“Towards federated learning at scale: system design,” arXiv preprint,arXiv:1902.01046 , 2019.[8] A. Gupta and R. K. Jha, “A survey of 5G network: architecture andemerging technologies,”

IEEE access , vol. 3, pp. 1206–1232, 2015.[9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

IEEE CVPR , 2016, pp. 770–778.[10] M. M. Amiri and D. G¨und¨uz, “Machine learning at the wireless edge:Distributed stochastic gradient descent over-the-air,”

IEEE Trans. SignalProcess. , vol. 68, pp. 2155–2169, 2020.[11] ——, “Federated learning over wireless fading channels,”

IEEE Trans.Wireless Commun. , vol. 19, no. 5, pp. 3546–3557, 2020.[12] G. Zhu and K. Huang, “Broadband analog aggregation for low-latencyfederated edge learning,”

IEEE Transactions on Wireless Communica-tions , vol. 19, no. 1, pp. 491–506, Jan. 2020.[13] T. Sery, N. Shlezinger, K. Cohen, and Y. Eldar, “Over-the-air federatedlearning from heterogeneous data,” arXiv preprint arXiv:2009.12787 ,2020.[14] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,”

IEEE Trans. Wireless Commun. , vol. 19, no. 3, pp.2022–2035, 2020.[15] G. Zhu, Y. Du, D. Gunduz, and K. Huang, “One-bit over-the-airaggregation for communication-efﬁcient federated edge learning: Designand convergence analysis,”

IEEE Trans. Wireless Commun. , pp. 1–1,2020.[16] M. M. Amiri, T. M. Duman, and D. Gunduz, “Collaborative machinelearning at the wireless edge with blind transmitters,” in

Proc. IEEEGlobal Conf. on Signal and Info. Proc. (GlobalSIP) , Dec. 2019.[17] M. M. Amiri, T. M. Duman, D. Gunduz, S. R. Kulkarni, and H. V. Poor,“Blind federated edge learning,” arXiv cs.IT.2010.10030 , 2020.[18] Y. Sun, S. Zhou, and D. G¨und¨uz, “Energy-aware analog aggregationfor federated learning with redundant data,” in

IEEE InternationalConference on Communications (ICC) , 2020, pp. 1–7.[19] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward anintelligent edge: wireless communication meets machine learning,”

IEEECommun. Mag. , vol. 58, no. 1, pp. 19–25, 2020.[20] D. Tse and P. Viswanath,

Fundamentals of wireless communication .Cambridge university press, 2005.[21] S. Verdu,

Multiuser detection . Cambridge university press, 1998.[22] A. Tveit, “On the complexity of matrix inversion,”

Mathematical Note ,p. 1, 2003.[23] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger, “Factor graphs andthe sum-product algorithm,”

IEEE Trans. on inf. theory , vol. 47, no. 2,pp. 498–519, 2001.[24] H.-A. Loeliger, J. Dauwels, J. Hu, S. Korl, L. Ping, and F. R. Kschis-chang, “The factor graph approach to model-based signal processing,”

Proc. IEEE , vol. 95, no. 6, pp. 1295–1322, 2007.[25] Y. Shao, S. C. Liew, and L. Lu, “Asynchronous physical-layer networkcoding: symbol misalignment estimation and its effect on decoding,”

IEEE Trans. Wireless Commun. , vol. 16, no. 10, pp. 6881–6894, 2017.[26] K. Murphy, Y. Weiss, and M. I. Jordan, “Loopy belief propaga-tion for approximate inference: an empirical study,” arXiv preprintarXiv:1301.6725 , 2013.[27] N. Noorshams and M. J. Wainwright, “Belief propagation for continuousstate spaces: stochastic message-passing with quantitative guarantees,”

J. Mach. Lear. Research , vol. 14, no. 1, pp. 2799–2835, 2013.[28] T. Wang, L. Shi, S. Zhang, and H. Wang, “Gaussian mixture messagepassing for blind known interference cancellation,”

IEEE Trans. WirelessCommun. , vol. 18, no. 9, pp. 4268–4282, 2019.[29] G. Forney, “Maximum-likelihood sequence estimation of digital se-quences in the presence of inter-symbol interference,”

IEEE Trans. Inf.Theory , vol. 18, no. 3, pp. 363–378, 1972.[30] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding oflinear codes for minimizing symbol error rate,”

IEEE Trans. Inf. Theory ,vol. 20, no. 2, pp. 284–287, 1974.[31] A. Krizhevsky, G. Hinton et al. , “Learning multiple layers of featuresfrom tiny images,”

Technical report , 2009.[32] N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufﬂenet v2: practicalguidelines for efﬁcient cnn architecture design,” in

ECCV , 2018, pp.116–131. [33] Y. Shao, G. Deniz, and S. C. Liew, “Federated edge learning withmisaligned over-the-air computation,” source code, available online at:https://github.com/lynshao/MisAlignedOAC , 2020.[34] J.-L. Gauvain and C.-H. Lee, “Maximum a posteriori estimation formultivariate gaussian mixture observations of markov chains,”

IEEETrans. Speech Audio Process. , vol. 2, no. 2, pp. 291–298, 1994.[35] P. Ahrendt, “The multivariate gaussian probability distribution,”

Techni-cal University of Denmark, Tech. Rep , 2005.[36] C. B. Do, “More on multivariate gaussians,”