[PDF] Federated Learning over Wireless Networks: A Band-limited Coordinated Descent Approach

Abstract

We consider a many-to-one wireless architecture for federated learning at the network edge, where multiple edge devices collaboratively train a model using local data. The unreliable nature of wireless connectivity, together with constraints in computing resources at edge devices, dictates that the local updates at edge devices should be carefully crafted and compressed to match the wireless communication resources available and should work in concert with the receiver. Thus motivated, we propose SGD-based bandlimited coordinate descent algorithms for such settings. Specifically, for the wireless edge employing over-the-air computing, a common subset of k-coordinates of the gradient updates across edge devices are selected by the receiver in each iteration, and then transmitted simultaneously over k sub-carriers, each experiencing time-varying channel conditions. We characterize the impact of communication error and compression, in terms of the resulting gradient bias and mean squared error, on the convergence of the proposed algorithms. We then study learning-driven communication error minimization via joint optimization of power allocation and learning rates. Our findings reveal that optimal power allocation across different sub-carriers should take into account both the gradient values and channel conditions, thus generalizing the widely used water-filling policy. We also develop sub-optimal distributed solutions amenable to implementation.

Full PDF

FFederated Learning over Wireless Networks:A Band-limited Coordinated Descent Approach

Junshan Zhang , Na Li , and Mehmet Dedeoglu School of Electrical, Computer and Energy Engineering, Arizona State University School of Engineering and Applied Sciences, Harvard University [email protected], [email protected], [email protected]

Abstract —We consider a many-to-one wireless architecture forfederated learning at the network edge, where multiple edgedevices collaboratively train a model using local data. The un-reliable nature of wireless connectivity, together with constraintsin computing resources at edge devices, dictates that the localupdates at edge devices should be carefully crafted and com-pressed to match the wireless communication resources availableand should work in concert with the receiver. Thus motivated, wepropose SGD-based bandlimited coordinate descent algorithmsfor such settings. Speciﬁcally, for the wireless edge employingover-the-air computing, a common subset of k-coordinates ofthe gradient updates across edge devices are selected by thereceiver in each iteration, and then transmitted simultaneouslyover k sub-carriers, each experiencing time-varying channelconditions. We characterize the impact of communication errorand compression, in terms of the resulting gradient bias and meansquared error, on the convergence of the proposed algorithms. Wethen study learning-driven communication error minimizationvia joint optimization of power allocation and learning rates.Our ﬁndings reveal that optimal power allocation across differentsub-carriers should take into account both the gradient valuesand channel conditions, thus generalizing the widely used water-ﬁlling policy. We also develop sub-optimal distributed solutionsamenable to implementation.

I. I

NTRODUCTION

In many edge networks, mobile and IoT devices collectinga huge amount of data are often connected to each other ora central node wirelessly. The unreliable nature of wirelessconnectivity, together with constraints in computing resourcesat edge devices, puts forth a signiﬁcant challenge for the com-putation, communication and coordination required to learn anaccurate model at the network edge. In this paper, we considera many-to-one wireless architecture for distributed learning atthe network edge, where the edge devices collaboratively traina machine learning model, using local data, in a distributedmanner. This departs from conventional approaches whichrely heavily on cloud computing to handle high complexityprocessing tasks, where one signiﬁcant challenge is to meetthe stringent low latency requirement. Further, due to privacyconcerns, it is highly desirable to derive local learning modelupdates without sending data to the cloud. In such distributedlearning scenarios, the communication between the edge de-vices and the server can become a bottleneck, in addition tothe other challenges in achieving edge intelligence.In this paper, we consider a wireless edge network with M devices and an edge server, where a high-dimensional machinelearning model is trained using distributed learning. In such a setting with unreliable and rate-limited communications,local updates at sender devices should be carefully craftedand compressed to make full use of the wireless communi-cation resources available and should work in concert withthe receiver (edge server) so as to learn an accurate model.Notably, lossy wireless communications for edge intelligencepresents unique challenges and opportunities [1], subject tobandwidth and power requirements, on top of the employedmultiple access techniques. Since it often sufﬁces to computea function of the sum of the local updates for training themodel, over-the-air computing is a favorable alternative tothe standard multiple-access communications for edge learn-ing. More speciﬁcally, over-the-air computation [2], [3] takesadvantage of the superposition property of wireless multiple-access channel via simultaneous analog transmissions of thelocal messages, and then computes a function of the messagesat the receiver, scaling signal-to-noise ratio (SNR) well withan increasing number of users. In a nutshell, when multipleedge devices collaboratively train a model, it is plausible toemploy distributed learning over-the-air.We seek to answer the following key questions: 1) What isthe impact of the wireless communication bandwidth/power onthe accuracy and convergence of the edge learning? 2) Whatcoordinates in local gradient signals should be communicatedby each edge device to the receiver? 3) How should thecoordination be carried out so that multiple sender devices canwork in concert with the receiver? 4) What is the optimal wayfor the receiver to process the received noisy gradient signalsto be used for the stochastic gradient descent algorithm? 5)How should each sender device carry out power allocationacross subcarriers to transmit its local updates? Intuitively, itis sensible to allocate more power to a coordinate with largergradient value to speed up the convergence. Further, powerallocation should also be channel-aware.To answer the above questions, we consider an integratedlearning and communication scheme where multiple edgedevices send their local gradient updates over multi-carriercommunications to the receiver for learning. Let K denotethe number of subcarriers for communications, where K isdetermined by the wireless bandwidth. First, K dimensionsof the gradient updates are determined (by the receiver) tobe transmitted. Multiple methods can be used for selecting K coordinates, e.g., selecting the top- k (in absolute value)coordinates of the sum of the gradients or randomized uni- a r X i v : . [ c s . N I] F e b e c e i v e r Learning Model(SGDUpdate) . . . Device 1 n-dimgradientupdate Receiver broadcast current iterate to the devices s ub c a rr i e r s Bandwidth-constrainedtransmission of 𝑘 -coordinates over wireless channel . . .. . . . . .. . . . ... . .. . . Learning-DrivenPowerAlloca � on D a t a S t r e a m Device 2 n-dimgradientupdate . . . Device M n-dimgradientupdate s ub c a rr i e r ss ub c a rr i e r s Learning-DrivenPowerAlloca � onLearning-DrivenPowerAlloca � on D a t a S t r e a m D a t a S t r e a m M J o i n t s e l e c � o n o f c oo r d i n a t e s o f g r a d i e n t s t o b e t r a n s m i � e d Fig. 1. A bandlimited coordinate descent algorithm for distributed learning over wireless multi-access channel form selection. This paper will focus on randomly uniformselection (we elaborate further on this in Section V). Dur-ing the subsequent communications, the gradient updatesare transmitted only in the K -selected dimensions via over-the-air computing over K corresponding sub-carriers, eachexperiencing time-varying channel conditions and hence time-varying transmission errors. The devices are subject to powerconstraints, giving rise to a key question on how to allocatetransmission power across dimension, at each edge device,based on the gradient update values and channel conditions.Thus, we explore joint optimization of the power allocationand the learning rate to obtain the best estimate of the gradientupdates and minimize the impact of the communication error.We investigate a centralized solution to this problem as abenchmark, and then devise sub-optimal distributed solutionsamenable to practical implementation. We note that we havealso studied the impact of errors of synchronization acrossdevices in this setting (we omit the details due to limitedspace).The main contributions of this paper are summarized asfollows: • We take a holistic approach to study federated learningalgorithms over wireless MAC channels, and the pro-posed bandlimited coordinated descent(BLCD) algorithmis built on innovative integration of computing in theair, multi-carrier communications, and wireless resourceallocation. • We characterize the impact of communication error andcompression, in terms of its resulting gradient bias andmean squared error (MSE), on the convergence perfor-mance of the proposed algorithms. Speciﬁcally, when thecommunication error is unbiased, the BLCD algorithmwould converge to a stationary point under very mildconditions on the loss function. In the case the bias in thecommunication error does exist, the iterates of the BLCDalgorithm would return to a contraction region centeredaround a scaled version of the bias inﬁnitely often. • To minimize the impact of the communication error, westudy joint optimization of power allocation at individual devices and learning rates at the receiver. Observe thatsince there exists tradeoffs between bias and variance,minimizing the MSE of the communication error doesnot necessarily amount to minimizing the bias therein.Our ﬁndings reveal that optimal power allocation acrossdifferent sub-carriers should take into account both thegradient values and channel conditions, thus generalizingthe widely used water-ﬁlling policy. We also developsub-optimal distributed solutions amenable to implemen-tation. In particular, due to the power constraints atindividual devices, it is not always feasible to achieveunbiased estimators of the gradient signal across thecoordinates. To address this complication, we developa distributed algorithm which can drive the bias in thecommunication error to (close to) zero under given powerconstraints and then reduce the corresponding variance asmuch as possible.II. R

ELATED W ORK

Communication-efﬁcient SGD algorithms are of great in-terest to reduce latency caused by the transmission of thehigh dimensional gradient updates with minimal performanceloss. Such algorithms in the ML literature are based oncompression via quantization [4]–[7], sparsiﬁcation [8]–[10]and federated learning [11] (or local updates [12]), wherelossless communication is assumed to be provided. At thewireless edge, physical-layer design and communication lossshould be taken into consideration for the adoption of thecommunication-efﬁcient algorithms.Power allocation for over-the-air computation is investigatedfor different scenarios in many other works [13]–[17] includ-ing MIMO, reduced dimensional MIMO, standard many to onechannel and different channel models. In related works on MLover wireless channels, [18]–[25] consider over-the-air trans-missions for training of the ML model. The authors in [21]propose sparsiﬁcation of the updates with compressive sensingfor further bandwidth reduction, and recovered sum of thecompressed sparse gradients is used for the update. They alsoapply a similar framework for federated learning and fadinghannels in [22]. [18] considers a broadband aggregation forfederated learning with opportunistic scheduling based on thechannel coefﬁcients for a set of devices uniformly distributedover a ring. Lastly, [25] optimize the gradient descent basedlearning over multiple access fading channels. It is worthnoting that the existing approaches for distributed learning inwireless networks do not fully account for the characteristicsof lossy wireless channels. It is our hope that the proposedBLCD algorithms can lead to an innovative architecture of dis-tributed edge learning over wireless networks that accounts forcomputation, power, spectrum constraints and packet losses.III. F

EDERATED L EARNING OVER W IRELESS M ULTI - ACCESS N ETWORKS

A. Distributed Edge Learning Model

Consider an edge computing environment with M devices M = { , . . . , M } and an edge server. As illustrated in Figure1, a high-dimensional ML model is trained at the server byusing an SGD based algorithm, where stochastic gradients arecalculated at the devices with the data points obtained by thedevices and a (common) subset of the gradient updates aretransmitted through different subcarriers via over-the-air.The general edge learning problem is as follows: min w ∈ R d f ( w ) := 1 M M (cid:88) m =1 E ξ m [ l ( w, ξ m )] , (1)in which l ( · ) is the loss function, and edge device m has accessto inputs ξ m . Such optimization is typically performed throughempirical risk minimization iteratively. In the sequel, we let w t denote the parameter value of the ML model at communicationround t , and at round t edge device m uses its local data ξ m,t to compute a stochastic gradient g mt ( w t ) := ∇ l ( w t , ξ m,t ) .Deﬁne g t ( w t ) = M (cid:80) Mm =1 g mt ( w t ) . The standard vanilla SGDalgorithms is given as w t +1 = w t − γg t ( w t ) (2)with γ being the learning rate. Nevertheless, different updatescan be employed for different SGD algorithms, and this studywill focus on communication-error-aware SGD algorithms. B. Bandlimited Coordinate Descent Algorithm

Due to the signiﬁcant discrepancy between the wirelessbandwidth constraint and the high-dimensional nature of thegradient signals, we propose a sparse variant of the SGDalgorithm over wireless multiple-access channel, named asbandlimited coordinate descent (BLCD), in which at each iter-ation only a common set of K coordinates, I ( t ) ⊂ { , . . . , d } (with K (cid:28) d ), of the gradients are selected to be transmittedthrough over-the-air computing for the gradient updates. Thedetails of coordinate selection for the BLCD algorithm arerelegated to Section VI. Worth noting is that due to theunreliable nature of wireless connectivity, the communicationis assumed to be lossy, resulting in erroneous estimation ofthe updates at the receiver. Moreover, gradient correction isperformed by keeping the difference between the update madeat the receiver and the gradient value at the transmitter forthe subsequent rounds, as gradient correction dramatically Algorithm 1

Bandlimited Coordinate Descent Algorithm Input:

Sample batches ξ m,t , model parameters w , initiallearning rate γ , sparsiﬁcation operator C t ( . ) , ∀ m =1 , . . . , M ; ∀ t = 1 , . . . , T. Initialize: r mt := 0 . for t = 1 : T do for m = 1 : M do g mt ( w t ) := stochasticGradient ( f ( w t , ξ m,t )) u mt := γg mt ( w t ) + r mt r mt +1 := u mt − C t ( u mt ) Compute power allocation coefﬁcients b ∗ km , ∀ k =1 , . . . , K . Transmit b ∗ (cid:12) C t ( u mt ) end for Compute gradient estimator ˆ G t ( w t ) w t +1 := w t − ˆ G t ( w t ) . Broadcast w t +1 back to all transmitters. end for improves the convergence rate with sparse gradient updates[9].For convenience, we ﬁrst deﬁne the gradient sparsiﬁcationoperator as follows. Deﬁnition 1. C I : R d → R d for a set I ⊆ { , . . . , d } asfollows: for every input x ∈ R d , (cid:0) C I ( x ) (cid:1) j is ( x ) j for j ∈ I and otherwise. Since this operator C I compress a d -dimensional vectorto a k -dimension one, we will also refer this operator ascompression operator in the rest of the paper.With a bit abuse of notation, we let C t denote C I ( t ) forconvenience in the following. Following [26], we incorporatethe sparsiﬁcation error made in each iteration (by the compres-sion operator C t ) into the next step to alleviate the possiblegradient bias therein and improve the convergence possible.Speciﬁcally, as in [26], one plausible way for compressionerror correction is to update the gradient correction term asfollows: r mt +1 = u mt − C t ( u mt ) , (3) u mt (cid:44) γg mt ( w t ) + r mt (4)which r mt +1 keeps the error in the sparsiﬁcation operator thatis in the memory of user m at around t , and u mt is the scaledgradient with correction at device m where the scaling factor γ is the learning rate in equation (2). (We refer readers to [26] formore insights of this error-feedback based compression SGD.)Due to the lossy nature of wireless communications, therewould be communication errors and the gradient estimatorsat the receiver would be erroneous. In particular, the gradientestimator at the receiver in the BLCD can be written as ˆ G t ( w t ) = 1 M M (cid:88) m =1 C t ( u mt ) + (cid:15) t , (5)where (cid:15) t denotes the random communication error in round t .In a nutshell, the bandlimited coordinate descent algorithm isoutlined in Algorithm 1.ecall that g t ( w t ) = M (cid:80) Mm =1 g mt ( w t ) and deﬁne r t (cid:44) M (cid:80) Mm =1 r mt . Thanks to the common sparsiﬁcation operatoracross devices, the update in the SGD algorithm at communi-catioon round t is given by w t +1 = w t − (cid:2) C t ( γg t ( w t ) + r t ) + (cid:15) t (cid:3) . (6)To quantify the impact of the communication error, we usethe corresponding communication-error free counterpart as thebenchmark, deﬁned as follows: ˆ w t +1 = w t − C t ( γg t ( w t ) + r t ) . (7)It is clear that w t +1 = ˆ w t +1 − (cid:15) t .For convenience, we deﬁne ˜ w t (cid:44) w t − r t . It can be shownthat ˜ w t +1 = ˜ w t − γg t ( w t ) − (cid:15) t . Intuitively, w t +1 in (6) is anoisy version of the iterate ˆ w t +1 in (7), which implies that ˜ w t +1 is a noisy version of the compression-error correction of ˆ w t +1 in (7), where the “noisy perturbation” is incurred by thecommunication error. C. BLCD Coordinate Transmissions over Multi-Access Chan-nel

Fig. 2. A multi-access communication protocol for bandlimited coordinateselection and transmission.

A key step in the BLCD algorithm is to achieve co-ordinate synchronization of the transmissions among manyedge devices. To this end, we introduce a receiver-drivenlow-complexity multi-access communication protocol, as il-lustrated in Fig. 2, with the function C t ( x ) denoting thecompression of x at round t . Let I ( t ) (of size K ) denotethe subset of coordinates chosen for transmission by thereceiver at round t . Observe that the updates at the receiverare carried out only in the dimensions I ( t ) . Further, the edgereceiver can broadcast its updated iterate to participant devices,over the reverse link. This task is quite simple, given thebroadcast nature of wireless channels. In the transmissions,each coordinate of the gradient updates is mapped to a speciﬁc subcarrier and then transmitted through the wireless MACchannel, and the coordinates transmitted by different devicesover the same subcarrier are received by the edge server in theform of an aggregate sum. It is worth noting that the aboveprotocol is also applicable to the case when the SGD updatesare carried out for multiple rounds at the devices.When there are many edge devices, over-the-air computa-tion can be used to take advantage of superposition propertyof wireless multiple-access channel via simultaneous analogtransmissions of the local updates. More speciﬁcally, at roundt, the received signal in subcarrier k is given by: y k ( t ) = M (cid:88) m =1 b km ( t ) h km ( t ) x km ( t ) + n k ( t ) (8)where b km ( t ) is a power scaling factor, h km ( t ) is the channelgain, and x km ( t ) is the message of user m through thesubcarrier k , respectively, and n k ( t ) ∼ N (0 , σ ) is the channelnoise.To simplify notation, we omit ( t ) when it is clear fromthe context in the following. Speciﬁcally, the message x km =( C t ( u mt )) l ( k ) , with a one-to-one mapping l ( k ) = ( I ( t )) k ,which indicates the k -th element of I ( t ) , transmitted throughthe k -th subcarrier. The total power that a device can usein the transmission is limited in practical systems. Withoutloss of generality, we assume that there is a power constraintat each device, given by (cid:80) Kk =1 | b km x km | ≤ E m , ∀ m ∈{ , . . . , M } . Note that b km hinges heavily upon both h m =[ h m , . . . , h Km ] (cid:62) and x m = [ x m , . . . , x Km ] (cid:62) , and a keynext step is to optimize b km ( h m , x m ) . In each round, eachdevice optimizes its power allocation for transmitting the se-lected coordinates of its update signal over the K subcarriers,aiming to minimize the communication error so as to achievea good estimation of G t ( w t ) (or its scaled version) for thegradient update, where G t ( w t ) (cid:44) M M (cid:88) m =1 C t ( u mt ) . From the learning perspective, based on { y k } Kk =1 , it isof paramount importance for the receiver to get a goodestimate of G t ( w t ) . Since n k ( t ) is Gaussian noise, the optimalestimator is in the form of (cid:0) (cid:98) G t ( w t ) (cid:1) k = (cid:40) α l ( k ) y l ( k ) , k ∈ I ( t )0 otherwise (9)where { α k } Kk =1 are gradient estimator coefﬁcients for subcar-riers. It follows that the communication error (i.e., the gradientestimation error incurred by lossy communications) is givenby (cid:15) t = (cid:98) G t ( w t ) − G t ( w t ) . (10)We note that { α k } Kk =1 are intimately related to the learningrates for the K coordinates, scaling the learning rate to be { γα k } Kk =1 . It is interesting to observe that the learning ratesin the proposed BLCD algorithm are essentially differentacross the dimensions, due to the unreliable and dynamicallychanging channel conditions across different subcarriers.V. I MPACT OF C OMMUNICATION E RROR AND C OMPRESSION ON

BLCD A

LGORITHM

Recall that due to the common sparsiﬁcation operator acrossdevices, the update in the SGD algorithm at communicationround t is given by w t +1 = w t − (cid:2) C t ( γg t ( w t ) + r t ) + (cid:15) t (cid:3) . Needless to say, the compression operator C t plays a criticalrole in sparse transmissions. In this study, we impose thefollowing standard assumption on the compression rate of theoperator. Assumption 1.

For a set of the random compression operators { C t } Tt =1 and any x ∈ R d , it holds E (cid:107) x − C t ( x ) (cid:107) ≤ (1 − δ ) (cid:107) x (cid:107) (11) for some δ ∈ (0 , . We impose the following standard assumptions on thenon-convex objective function f ( · ) and the correspondingstochastic gradients g mt ( w t ) computed with the data samplesof device m in round t . (We assume that the data samples { ξ m,t } are i.i.d. across the devices and time.) Assumption 2. (Smoothness) A function f : R d → R is L-smooth if for all x, y ∈ R d , it holds | f ( y ) − f ( x ) − (cid:104)∇ f ( x ) , y − x (cid:105)| ≤ L (cid:107) y − x (cid:107) . (12) Assumption 3.

For any x ∈ R d and for any m = 1 , . . . , M ,a stochastic gradient g mt ( x ) , ∀ t , satisﬁes E [ g mt ( x )] = ∇ f ( x ) , E (cid:107) g mt ( x ) (cid:107) ≤ G (13) where G > is a constant. It follows directly from [26] that E [ (cid:107) r t (cid:107) ] ≤ − δ ) δ γ G . Recall that ˜ w t +1 = ˜ w t − γg t ( w t ) − (cid:15) t and that ˜ w t +1 can beviewed as a noisy version of the compression-error correctionof ˆ w t +1 in (7), where the “noisy perturbation” is incurred bythe communication error. For convenience, let E t [ (cid:15) t ] denotethe gradient bias incurred by the communication error and E t [ (cid:107) (cid:15) t (cid:107) ] be the corresponding mean square error, where E t is taken with respect to channel noise.Let η = L − γγ (2 − ργ ) with < ρ < . Let f ∗ denote the globallyminimum value of f . We have the following main result onthe iterates in the BLCD algorithm. Theorem 1.

Under Assumptions 1, 2 and 3, the iterates { w t } in the BLCD algorithm satisﬁes that T +1 T (cid:88) t =0  (cid:107)∇ f ( w t ) (cid:107) − η (cid:107) E t [ (cid:15) t ] (cid:124) (cid:123)(cid:122) (cid:125) bias (cid:107)  ≤ T +1 T (cid:88) t =0  LηL − γ E t [ (cid:107) (cid:15) t (cid:107) ] (cid:124) (cid:123)(cid:122) (cid:125) MSE + (cid:0) η (cid:1) (cid:107) E t [ (cid:15) t ] (cid:124) (cid:123)(cid:122) (cid:125) bias (cid:107)  + 2 T +1 f ( w ) − f ∗ γ (2 − ργ ) + (cid:18) Lρ − δ ) δ + 12 (cid:19) LγG − ργ . (15) Proof.

Due to the limited space, we outline only a few mainsteps for the proof. Recall that ˜ w t = w t − r t . It can be shown that ˜ w t +1 = ˜ w t − γg t ( w t ) − (cid:15) t . As shown in (14),using the properties of the iterates in the BLCD algorithmand the smoothness of the objective function f , we canestablish an upper bound on E t [ f ( ˜ w t +1 )] in terms of f ( ˜ w t ) the corresponding gradient ∇ f ( w t ) , and the gradient biasand MSE due to the communication error. Then, (15) can beobtained after some further algebraic manipulation. Remarks.

Based on Theorem 1, we have a few observationsin order. • We ﬁrst examine the four terms on the right hand sideof (15): The ﬁrst two terms capture the impact on thegradient by the time average of the bias in the commu-nication error (cid:15) t and that of the corresponding the meansquare, denoted as MSE; the two items would go to zeroif the bias and the MSE diminish; the third term is ascaled version of f ( w ) − f ∗ and would go to zero aslong as γ = O ( T − β ) with β < ; and the fourth term isproportional to γ and would go to zero when γ → . • If the right hand side of (15) diminishes as T → ∞ ,the iterates in the BLCD algorithm would “converge”to a neighborhood around η (cid:107) E t [ (cid:15) t ] (cid:107) , which is a scaledversion of the bias in the communication error. Forconvenience, let ¯ (cid:15) = lim sup t (cid:107) E t [ (cid:15) t ] (cid:107) , and deﬁne acontraction region as follows: A γ = { w t : (cid:107)∇ f ( w t ) (cid:107) ≤ ( η + ∆)¯ (cid:15) } . where ∆ > is an arbitrarily small positive number.It then follows that the iterates in the BLCD algorithmwould “converge” to a contraction region given by A γ ,in the sense that the iterates return to A γ inﬁnitely often.Note that f is assumed to be any nonconvex smoothfunction, and there can be many contraction regions, eachcorresponding to a stationary point. • When the communication error is unbiased, the gradientswould diminish to and hence the BLCD algorithmwould converge to a stationary point. In the case thebias in the communication error does exist, there existsintrinsic tradeoff between the size of the contractionregion and η (cid:107) E t [ (cid:15) t ] (cid:107) . When the learning rate γ is small,the right hand side of (15) would small, but η can be large,and vice verse. It makes sense to choose a ﬁxed learningrate that would make η small. In this way, the gradientsin the BLCD algorithm would “concentrate” around a(small) scaled version of the bias. • Finally, the impact of gradient sparsiﬁcation is capturedby δ . For instance, when (randomly) uniform selection isused, δ = kd . We will elaborate on this in Section VI.Further, we have the following corollary. Corollary 1.

Under Assumptions 1, 2, and 3, we have that if E t [ (cid:15) t ] = 0 and γ = √ T +1 , the BLCD algorithm converges toa stationary point and satisﬁes that T +1 T (cid:88) t =0 (cid:107)∇ f ( w t ) (cid:107) t [ f ( ˜ w t +1 )] ≤ f ( ˜ w t )+ (cid:104)∇ f ( ˜ w t ) , E t [ ˜ w t +1 − ˜ w t ] (cid:105) + L E t [ (cid:107) ˜ w t +1 − ˜ w t (cid:107) ]= f ( ˜ w t ) −(cid:104)∇ f ( ˜ w t ) , γ E t [ g t ( w t )]+ E t [ (cid:15) t ] (cid:105) + L E t [ (cid:107) γg t ( w t ) (cid:107) ]+ L E t [ (cid:107) (cid:15) t (cid:107) ]+ L E t [ (cid:104) γg t ( w t ) , (cid:15) t (cid:105) ]= f ( ˜ w t ) −(cid:104)∇ f ( w t ) , γ E t [ g t ( w t )]+ E t [ (cid:15) t ] (cid:105)−(cid:104)∇ f ( ˜ w t ) −∇ f ( w t ) , γ E t [ g t ( w t )]+ E t [ (cid:15) t ] (cid:105) + L E t [ (cid:107) (cid:15) t (cid:107) ]+ L E t [ (cid:104) γg t ( w t ) , (cid:15) t (cid:105) ]+ L E t [ (cid:107) γg t ( w t ) (cid:107) ≤ f ( ˜ w t ) − γ (cid:107)∇ f ( w t ) (cid:107) −(cid:104)∇ f ( w t ) , E t [ (cid:15) t ] (cid:105) + ρ (cid:107) γ ∇ f ( w t )+ E t [ (cid:15) t ] (cid:107) + L ρ E t [ (cid:107) r t (cid:107) ]+ L E t [ (cid:107) (cid:15) t (cid:107) ]+ L (cid:104)∇ f ( w t ) , E t [ (cid:15) t ] (cid:105) + Lγ E t (cid:107) g t ( w t ) (cid:107) ≤ f ( ˜ w t ) − γ (cid:107)∇ f ( w t ) (cid:107) +( L − (cid:107)∇ f ( w t ) (cid:107)(cid:107) E t [ (cid:15) t ] (cid:107) + ρ (cid:0) γ (cid:107)∇ f ( w t ) (cid:107) + (cid:107) E t [ (cid:15) t ] (cid:107) +2 γ (cid:104)∇ f ( w t ) , E t [ (cid:15) t ] (cid:105) (cid:1) + L ρ E t [ (cid:107) r t (cid:107) ]+ L E t [ (cid:107) (cid:15) t (cid:107) ]+ Lγ G ≤ f ( ˜ w t ) − γ (cid:107)∇ f ( w t ) (cid:107) + ( L − γ ) (cid:107)∇ f ( w t ) (cid:107)(cid:107) E t [ (cid:15) t ] (cid:107) + γ ρ (cid:107)∇ f ( w t ) (cid:107) + L ρ E t [ (cid:107) r t (cid:107) ] + (cid:107) E t [ (cid:15) t ] (cid:107) + L E t [ (cid:107) (cid:15) t (cid:107) ] + Lγ G = f ( ˜ w t ) − γ (cid:104) − ρ γ (cid:105) (cid:107)∇ f ( w t ) (cid:107) + ( L − γ ) (cid:107)∇ f ( w t ) (cid:107)(cid:107) E t [ (cid:15) t ] (cid:107) + L ρ E t [ (cid:107) r t (cid:107) ] + (cid:107) E t [ (cid:15) t ] (cid:107) + L E t [ (cid:107) (cid:15) t (cid:107) ] + Lγ G (14) ≤ − ρ √ T +1 (cid:26) f ( w ) − f ∗ ) √ T + 1 + 2 LG √ T +1 (cid:18) Lρ − δ ) δ + 12 (cid:19) + LT + 1 T (cid:88) t =0 E t [ (cid:107) (cid:15) t (cid:107) ] (cid:124) (cid:123)(cid:122) (cid:125) MSE  (16)V. C

OMMUNICATION E RROR M INIMIZATION VIA J OINT O PTIMIZATION OF P OWER A LLOCATION AND L EARNING R ATES

Theorem 1 reveals that the communication error has asigniﬁcant impact on the convergence behavior of the BLCDalgorithm. In this section, we turn our attention to minimizingthe communication error (in term of MSE and bias) via jointoptimization of power allocation and learning rates.Without loss of generality, we focus on iteration t (withabuse of notation, we omit t in the notation for simplicity).Recall that the coordinate updates in the BLCD algorithm,sent by different devices over the same subcarrier, are re-ceived by the edge server as an aggregate sum, which isused to estimate the gradient value in that speciﬁc dimen-sion. We denote the power coefﬁcients and estimators as b (cid:44) [ b , b , . . . , b M , b , . . . , b KM ] and α (cid:44) [ (cid:126)α , . . . , (cid:126)α K ] .In each round, each sender device optimizes its power al-location for transmitting the selected coordinates of theirupdates over the K subcarriers, aiming to achieve the bestconvergence rate. We assume that the perfect channel stateinformation is available at the corresponding transmitter, i.e., h m = [ h m , . . . , h Km ] (cid:62) is available at the sender m only.Based on (10), the mean squared error of the communicationerror in iteration t is given by E t [ (cid:107) (cid:15) t (cid:107) ] = E (cid:20) (cid:13)(cid:13)(cid:13) (cid:98) G t ( w t ) − G t ( w t ) (cid:13)(cid:13)(cid:13) (cid:21) (17)where the expectation is taken over the channel noise. Forconvenience, we denote E t [ (cid:107) (cid:15) t (cid:107) ] as MSE , and after somealgebra, it can be rewritten as the sum of the variance and thesquare of the bias:MSE ( α , b )= K (cid:88) k =1 (cid:20) M (cid:88) m =1 (cid:18) α k b km h km − M (cid:19) x km (cid:124) (cid:123)(cid:122) (cid:125) bias in k th coordinate (cid:21) + K (cid:88) k =1 σ α k (cid:124) (cid:123)(cid:122) (cid:125) variance (18)Recall that { α k } Kk =1 are intimately related to the learning ratesfor the K coordinates, making the learning rate effectively { γα k } Kk =1 . A. Centralized Solutions to Minimizing MSE (Scheme 1)

In light of the above, we can cast the MSE minimizationproblem as a learning-driven joint power allocation and learn-ing rate problem, given by

P1: min α , b MSE ( α , b ) (19)s.t. K (cid:88) k =1 | b km x km | ≤ E m , ∀ m (20) b km ≥ , α k ≥ ∀ k, m (21)which minimizes the MSE for every round.The above formulated problem is non-convex because theobjective function involves the product of variables. Neverthe-less, it is biconvex, i.e., for one of the variables being ﬁxed,the problem is convex for the other one. In general, we cansolve the above bi-convex optimization problem in the samespirit as in the EM algorithm, by taking the following twosteps, each optimizing over a single variable, iteratively: P1-a: min α MSE ( α , b ) s.t. α k ≥ , ∀ k P1-b: min b MSE ( α , b ) s.t. K (cid:88) k =1 | b km x km | ≤ E m ∀ m, b km ≥ ∀ k, m. Since (

P1-a ) is unconstrained,for given { b km } , the optimalsolution to ( P1-a ) is given by α ∗ k = max (cid:40) (cid:0) (cid:80) Mm =1 x km (cid:1)(cid:0) (cid:80) Mm =1 b km h km x km (cid:1) M (cid:2) σ + (cid:0) (cid:80) Mm =1 b km h km x km (cid:1) (cid:3) , (cid:41) . (22)Then, we can solve ( P1-b ) by optimizing b only. Solving thesub-problems ( P1-a ) and (

P1-b ) iteratively leads to a localminimum, however, not necessarily to the global solution.Observe that the above solution requires the global knowl-edge of x km ’s and h km ’s of all devices, which is difﬁcultto implement in practice. We will treat it as a benchmarkonly. Next, we turn our attention to developing distributed sub-optimal solutions. B. Distributed Solutions towards Zero Bias and VarianceReduction (Scheme 2)

As noted above, the centralized solution to ( P1 ) requiresthe global knowledge of x km ’s and h km ’s and hence is notamenable to implementation. Further, minimizing the MSEof the communication error does not necessarily amounto minimizing the bias therein since there exists tradeoffsbetween bias and variance. Thus motivated, we next focus ondevising distributed sub-optimal solutions which can drive thebias in the communication error to (close to) zero, and thenreduce the corresponding variance as much as possible.Speciﬁcally, observe from (18) that the minimization ofMSE cost does not necessarily ensure ˆ G to be an unbiased es-timator, due to the intrinsic tradeoff between bias and variance.To this end, we take a sub-optimal approach where the opti-mization problem is decomposed into two subproblems. In thesubproblem at the transmitters, each device m utilizes its avail-able power and local gradient/channel information to computea power allocation policy in terms of { b m , b m , . . . , b Km } .In the subproblem at the receiver, the receiver ﬁnds the bestpossible α k for all k = 1 , . . . , K . Another complication is thatdue to the power constraints at individual devices, it is notalways feasible to achieve unbiased estimators of the gradientsignal across the coordinates. Nevertheless, for given powerconstraints, one can achieved unbiased estimators of a scaleddown version of the coordinates of the gradient signal. In lightof this, we formulate the optimization problem at each device(transmitter) m to ensure an unbiased estimator of a scaledversion ζ m of the transmitted coordinates, as follows: Device m: max { b km } k =1: K ζ m (23)s.t. K (cid:88) k =1 b km x km ≤ E m , b km ≥ , (24) ζ m x km − b km h km x km = 0 , ∀ k = 1 , . . . , K, (25)where maximizing ζ m amounts to maximizing the corre-sponding SNR (and hence improving the gradient estimationaccuracy). The ﬁrst constraint in the above is the powerconstraint, and the second constraint is imposed to ensurethat there is no bias of the same scaled version of thetransmitted signals across the dimensions for user m . Thepower allocation solution can be found using Karush-Kuhn-Tucker (KKT) conditions as follows: ζ ∗ m = (cid:118)(cid:117)(cid:117)(cid:116) E m (cid:80) Kk =1 x km h km , b ∗ km = ζ ∗ m h km , ∀ k. (26)Observe that using the obtained power allocation policy in(66), all K transmitted coordinates for device m have the samescaling factor ζ m . Next, we will ensure zero bias by choosingthe right α for gradient estimation at the receiver, which canbe obtained by solving the following optimization problemsince all transmitted gradient signals are superimposed via theover-the-air transmission: Receiver side: min { α k } , K (cid:88) k =1 ν k ( α k , { b ∗ km } ) (27)s.t. e k ( α k , { b ∗ km } ) = 0 , α k ≥ , ∀ k = 1 , . . . , K, (28)where e k and ν k denote the bias and variance components,given as follows: e k ( α k , { b ∗ km } ) = α k (cid:32) M (cid:88) m =1 ζ ∗ m x km (cid:33) − M M (cid:88) m =1 x km , ν k ( α k , { b ∗ km } ) = α k σ , (29)for all k = 1 , . . . , K . For given { ζ ∗ m } , it is easy to see that α ∗ k = M (cid:80) Mm =1 x km (cid:80) Mm =1 ζ ∗ m x km (cid:39) (cid:80) Mm =1 ζ ∗ m , ∀ k. (30)We note that in the above, from an implementation point ofview, since { x km } is not available at the receiver, it is sensibleto set α † k (cid:39) (cid:80) Mm =1 ζ ∗ m . Further, { ζ ∗ m } is not readily availableat the receiver either. Nevertheless, since there is only oneparameter ζ ∗ m from each sender m , the sum (cid:80) Mm =1 ζ ∗ m can besent over a control channel to the receiver to compute α † k . Itis worth noting that in general the bias exists even if E m isthe same for all senders.Next, we take a closer look at the case when the number ofsubchannels K is large (which is often the case in practice).Suppose that { x km } are i.i.d. across subchannels and users,and so are { h km } . We can then simplify ζ ∗ m further. For ease ofexposition, we denote E [ x km ] = ϕ + ¯ x and E (cid:104) h km (cid:105) = (cid:36) .When K is large, for every user m we have that: ζ ∗ m = √ E m (cid:114)(cid:80) Kk =1 x km h km = ⇒ when K is large ζ ∗ m ≈ √ E m (cid:112) K ( ϕ + ¯ x ) (cid:36) (31)As a result, the bias and variance for each dimension k couldbe written as, e k ( α ∗ k , { b ∗ km } ) = M (cid:88) m =1 (cid:34) √ E m (cid:80) Mm =1 √ E m − M (cid:35) x km , ∀ k. (32) ν k = K(cid:36) ( ϕ + ¯ x ) (cid:16)(cid:80) Mm =1 √ E m (cid:17) σ , ∀ k. (33)Observe that when E m is the same across the senders, the biasterm E t [ (cid:15) t ] = in the above setting according to (32). C. A User-centric Approach Using Single-User Solution(Scheme 3)

In this section, we consider a suboptimal user-centric ap-proach, which provides insight on the power allocation acrossthe subcarriers from a single device perspective. We formulatethe single device (say user m ) problem as P2: min { b km } , { α k } K (cid:88) k =1 (cid:20)(cid:0) α k b km h km − (cid:1) x km (cid:21) + σ K (cid:88) k =1 α k s.t. K (cid:88) k =1 | b km x km | ≤ E m ; b k ≥ , α k ≥ , ∀ k. Theorem 2.

The optimal solution { b ∗ km , α ∗ k } to ( P2 ) is givenby ( b ∗ km ) = (cid:20)(cid:115) σ λx km h km − σ h km x km (cid:21) + , ∀ k, (34) α ∗ k = b ∗ km h km x k σ + ( b ∗ km ) h km x km , ∀ k, (35) where λ m is a key parameter determining the waterﬁllinglevel: K (cid:88) k =1 (cid:20)(cid:114) λ m (cid:115) x km σ h km − σ h km (cid:21) + = E m . (36)he proof of Theorem 2 is omitted due to space limitation.Observe that Theorem 2 reveals that the larger the gradientvalue (and the smaller channel gain) in one subcarrier, thehigher power the it should be allocated to in general, andthat { x km /h km ) } can be used to compute the water level forapplying the water ﬁlling policy.Based on the above result, in the multi-user setting, eachdevice can adopt the above single-user power allocation so-lution as given in Theorem 2. This solution can be appliedindividually without requiring any coordination between de-vices.Next, we take a closer look at the case when the numberof subchannels K is large. Let ¯ E m denote the average powerconstraint per subcarrier. When K is large, after some algebra,the optimization problem P2 can be further approximated asfollows: P3: min b km E (cid:20) x km σ b km h km x km + σ (cid:21) s.t. E (cid:2) b km x km (cid:3) ≤ ¯ E m , b km ≥ , (37)where the expectation is taken with respect to { h km } and { x km } .The solution for k = 1 , . . . , K is obtained as follows: b ∗ km = (cid:115)(cid:20) σ | x km | − h km √ λ m − σ x km h km (cid:21) + (38) λ m < h km x km σ k ⇒ b ∗ km > (39)We can compute the bias and the variance accordingly.VI. C OORDINATE S ELECTION FOR B ANDLIMITED C OORDINATE D ESCENT A LGORITHMS

The selection of which coordinates to operate on is crucialto the performance of sparsiﬁed SGD algorithms. It is not hardto see that selecting the top- k (in absolute value) coordinatesof the sum of the gradients provides the best performance.However, in practice it may not always be feasible to obtaintop- k of the sum of the gradients, and in fact there are differentsolutions for selecting k dimensions with large absolute values;see e.g., [22], [27]. Note that each device individually transmit-ting top- k coordinates of their local gradients is not applicableto the scenario of over-the-air communications consideredhere. Sequential device-to-device transmissions provides analternative approach [28], but these techniques are likely torequire more bandwidth with wireless connection.Another approach that is considered is the use of compres-sion and/or sketching for the gradients to be transmitted. Forinstance, in [22], a system that updates SGD via decompress-ing the compressed gradients transmitted through over-the-aircommunication is examined. To the best of our knowledge,such techniques do not come with rigorous convergence guar-antees. A similar approach is taken in [27], where the sketchedgradients are transmitted through an error-free medium andthese are then used to obtain top- k coordinates; the devicesnext simply transmit the selected coordinates. Although suchan approach can be taken with over-the-air computing sinceonly the summation of the sketched gradients is necessary; this requires the transmission of O ( k log d ) dimensions. To provideguarantees with such an approach O ( k log d + k ) up-linktransmissions are needed. Alternatively, uniformly selected O ( k log d + k ) coordinates can be transmitted with similarbandwidth and energy requirements. For the practical learningmodels with non-sparse updates, uniform coordinate selectiontend to perform better. Moreover, the common K dimen-sions can be selected uniformly via synchronized pseudo-random number generators without any information transfer.To summarize, uniform selection of the coordinates is moreattractive based on the energy, bandwidth and implementationconsiderations compared to the methods aiming to recover top- k coordinates; indeed, this is the approach we adopt.VII. E XPERIMENTAL R ESULTS

In this section, we evaluate the accuracy and convergenceperformance of the BLCD algorithm, when using one of thefollowing three schemes for power allocation and learning rateselection (aiming to minimize the impact of communicationerror): 1) the bi-convex program based solution (Scheme 1),2) the distributed solution towards zero bias in Section V.(Scheme 2); 3) the single-user solution (Scheme 3). Weuse the communication error free scheme as the baseline toevaluate the performance degradation. We also consider thenaive scheme (Scheme 4) using equal power allocation for alldimensions, i.e., b km = (cid:113) E/ (cid:80) Kk =1 x km .In our ﬁrst experiment, we consider a simple single layerneural network trained on the MNIST dataset. The networkconsists of two 2-D convolutional layers with ﬁlter size × followed by a single fully connected layer and it has 7840parameters. K = 64 dimensions are uniformly selected asthe support of the sparse gradient transmissions. For conve-nience, we deﬁne E avg as the average sum of the energy(of all devices) per dimension normalized by the channelnoise variance, i.e., E avg = EM E [ h km ] /Kσ . Without lossof generality, we take the variance of the channel noise as σ = 1 and { h km } are independent and identically distributedRayleigh random variables with mean . The changes on E avg simply amount to different SNR values. In Fig. 3,we take K = 64 , M = 8 , batch size to calculate eachgradient, and the learning rate γ = 0 . . In the secondexperiment, we expand the model to a more sophisticated -layer neural network and an -layer ResNet [29] with and parameters, respectively. The -layer networkconsists of two 2-D convolutional layers with ﬁlter size × followed by three fully connected layers. In all experiments,we have used a learning rate of . . the local dataset ofeach worker is randomly selected from the entire MNISTdataset. We use workers with varying batch sizes and weutilize K = 1024 sub-channels for sparse gradient signaltransmission.It can be seen from Fig. 3 that in the presence of thecommunication error, the centralized solution (Scheme 1)based on bi-convex programming converges quickly and per-forms the best, and it can achieve accuracy close to the idealerror-free scenario. Further, the distributed solution (Scheme ig. 3. Testing accuracy over training iterations for α k = 1 / , E avg = 0 . and a batch size of . Training model consists of a single layer neural networkwith 7840 differentiable parameters.Fig. 4. Testing accuracy over training iterations for workers and a batchsize of . Training model consists of a -layer deep neural network with61706 differentiable parameters.

2) can eventually approach the performance of Scheme 1,but the single-user solution (Scheme 3) performs poorly, sodoes the naive scheme using equal power allocation (Scheme4). Clearly, there exists signiﬁcant gap between its resultingaccuracy and that in the error-free case, and this is becausethe bias in Scheme 3 is more signiﬁcantly.Next, Figures 4, 5 and 6 depict the results in the secondexperiment using much larger-scale deep neural networks. Itcan be observed from Figs. 4, 5 and 6 that the SNR canhave signiﬁcant impact on the ﬁnal accuracy. As expected, theconvergence on the ResNet network is slower in comparisonto other DNNs due to the huge number of parameters andsmall batch size. Nevertheless, it is clear that the learningaccuracy improves signiﬁcantly at high SNR. (The solutionof the distributed algorithm for E avg = 10 is omitted in Fig.6, since it is indistinguishably close to error-free solution.)It is interesting to observe that when the SNR increases, thedistributed solution (Scheme 2) can achieve accuracy close tothe ideal error-free case, but the single-user solution (Scheme3) would not. It is worth noting that due to the computationalcomplexity of bi-convex programming in this large-scale case,Scheme 4 could be solved effectively (we did not present ithere). Further, the batch size at each worker can impact theconvergence rate, but does not impact the ﬁnal accuracy.VIII. C ONCLUSIONS

In this paper, we consider a many-to-one wireless archi-tecture for distributed learning at the network edge, where

Fig. 5. Testing accuracy over training iterations for workers and a batchsize of . Training model consists of a -layer deep neural network with61706 differentiable parameters.Fig. 6. Testing accuracy over training iterations for devices and a batchsize of . Training model consists of an -layer ResNet network with morethan 11 million differentiable parameters. multiple edge devices collaboratively train a machine learningmodel, using local data, through a wireless channel. Observingthe unreliable nature of wireless connectivity, we designan integrated communication and learning scheme, wherethe local updates at edge devices are carefully crafted andcompressed to match the wireless communication resourcesavailable. Speciﬁcally, we propose SGD-based bandlimitedcoordinate descent algorithms employing over-the-air com-puting, in which a subset of k-coordinates of the gradientupdates across edge devices are selected by the receiver ineach iteration and then transmitted simultaneously over k sub-carriers. We analyze the convergence of the algorithms pro-posed, and characterize the effect of the communication error.Further, we study joint optimization of power allocation andlearning rates therein to maximize the convergence rate. Ourﬁndings reveal that optimal power allocation across differentsub-carriers should take into account both the gradient valuesand channel conditions. We then develop sub-optimal solutionsamenable to implementation and verify our ﬁndings throughnumerical experiments.A CKNOWLEDGEMENTS

The authors thank Gautam Dasarathy for stimulating discus-sion in the early stage of this work. This work is supported inpart by NSF Grants CNS-2003081, CNS-CNS-2003111, CPS-1739344 and ONR YIP N00014-19-1-2217.

EFERENCES[1] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towardsan intelligent edge: Wireless communication meets machine learning,” arXiv preprint arXiv:1809.00343 , 2018.[2] M. Goldenbaum and S. Stanczak, “Robust analog function computationvia wireless multiple-access channels,”

IEEE Transactions on Commu-nications , vol. 61, no. 9, pp. 3863–3877, 2013.[3] O. Abari, H. Rahul, and D. Katabi, “Over-the-air function computationin sensor networks,” arXiv preprint arXiv:1612.02307 , 2016.[4] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd:Communication-efﬁcient sgd via gradient quantization and encoding,”in

Advances in Neural Information Processing Systems , 2017, pp. 1709–1720.[5] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad:Ternary gradients to reduce communication in distributed deep learning,”in

Advances in Neural Information Processing Systems , 2017, pp. 1509–1519.[6] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar,“Signsgd: Compressed optimisation for non-convex problems,” in

In-ternational Conference on Machine Learning , 2018, pp. 559–568.[7] J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantizedsgd and its applications to large-scale distributed optimization,” in

International Conference on Machine Learning , 2018, pp. 5321–5329.[8] A. F. Aji and K. Heaﬁeld, “Sparse communication for distributedgradient descent,” in

Proceedings of the 2017 Conference on EmpiricalMethods in Natural Language Processing , 2017, pp. 440–445.[9] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsiﬁed sgd withmemory,” in

Advances in Neural Information Processing Systems , 2018,pp. 4447–4458.[10] D. Alistarh, T. Hoeﬂer, M. Johansson, N. Konstantinov, S. Khirirat,and C. Renggli, “The convergence of sparsiﬁed gradient methods,” in

Advances in Neural Information Processing Systems , 2018, pp. 5973–5983.[11] J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, andD. Bacon, “Federated learning: Strategies for improving communicationefﬁciency,” arXiv preprint arXiv:1610.05492 , 2016.[12] S. U. Stich, “Local sgd converges fast and communicates little,” in

ICLR2019 International Conference on Learning Representations , 2019.[13] J. Dong, Y. Shi, and Z. Ding, “Blind over-the-air computation and datafusion via provable wirtinger ﬂow,” arXiv preprint arXiv:1811.04644 ,2018.[14] W. Liu and X. Zang, “Over-the-air computation systems: Optimization,analysis and scaling laws,” arXiv preprint arXiv:1909.00329 , 2019.[15] D. Wen, G. Zhu, and K. Huang, “Reduced-dimension design of MIMOover-the-air computing for data aggregation in clustered iot networks,”

IEEE Transactions on Wireless Communications , vol. 18, no. 11, pp.5255–5268, 2019.[16] G. Zhu and K. Huang, “MIMO over-the-air computation for high-mobility multi-modal sensing,”

IEEE Internet of Things Journal , 2018.[17] X. Cao, G. Zhu, J. Xu, and K. Huang, “Optimal power con-trol for over-the-air computation in fading channels,” arXiv preprintarXiv:1906.06858 , 2019.[18] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation forlow-latency federated edge learning,”

IEEE Transactions on WirelessCommunications , 2019.[19] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,” arXiv preprint arXiv:1812.11750 , 2018.[20] Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energy-efﬁcient ra-dio resource allocation for federated edge learning,” arXiv preprintarXiv:1907.06040 , 2019.[21] M. M. Amiri and D. G¨und¨uz, “Machine learning at the wirelessedge: Distributed stochastic gradient descent over-the-air,” arXiv preprintarXiv:1901.00844 , 2019.[22] ——, “Federated learning over wireless fading channels,” arXiv preprintarXiv:1907.09769 , 2019.[23] ——, “Over-the-air machine learning at the wireless edge,” in . IEEE, 2019, pp. 1–5.[24] J.-H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation fordistributed edge learning with heterogeneous data,” in . IEEE, 2019, pp. 1–6. [25] T. Sery and K. Cohen, “On analog gradient descent learning overmultiple access fading channels,” arXiv preprint arXiv:1908.07463 ,2019.[26] S. P. Karimireddy, Q. Rebjock, S. U. Stich, and M. Jaggi, “Errorfeedback ﬁxes signsgd and other gradient compression schemes,” arXivpreprint arXiv:1901.09847 , 2019.[27] N. Ivkin, D. Rothchild, E. Ullah, V. Braverman, I. Stoica, and R. Arora,“Communication-efﬁcient distributed sgd with sketching,” arXiv preprintarXiv:1903.04488 , 2019.[28] S. Shi, Q. Wang, K. Zhao, Z. Tang, Y. Wang, X. Huang, and X. Chu, “Adistributed synchronous sgd algorithm with global top-k sparsiﬁcationfor low bandwidth networks,” in . IEEE, 2019, pp.2238–2247.[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” arXiv preprint arXiv:1512.03385 , 2015.

PPENDIX

A. Proof of Theorem 1

We here restate equations (6) and (7) as follows: w t +1 = w t − [ C t ( γg t ( w t ) + r t ) + (cid:15) t ] (40) ˆ w t +1 = w t − C t ( γg t ( w t ) + r t ) (41)It is clear that w t +1 = ˆ w t +1 − (cid:15) t . For convenience, we deﬁne ˜ w t = w t − r t = ˆ w t − r t − (cid:15) t − . It can be shown that ˜ w t +1 = ˜ w t − γg t ( w t ) − (cid:15) t . E t [ f ( ˜ w t +1 )] ≤ f ( ˜ w t )+ < ∇ f ( ˜ w t ) , E t [ ˜ w t +1 − ˜ w t ] > + L E t [ || ˜ w t +1 − ˜ w t || ] (42) = f ( ˜ w t ) − < ∇ f ( ˜ w t ) , γ E t [ g t ( w t )] + E t [ (cid:15) t ] > + L E t [ || γg t ( w t ) || ]+ L E t [ || (cid:15) t || ] + L E t [ < γg t ( w t ) , (cid:15) t > ] (43) = f ( ˜ w t ) − < ∇ f ( w t ) , γ E t [ g t ( w t )] + E t [ (cid:15) t ] > − < ∇ f ( ˜ w t ) − ∇ f ( w t ) , γ E t [ g t ( w t )] + E t [ (cid:15) t ] > + L E t [ || (cid:15) t || ] + L E t [ < γg t ( w t ) , (cid:15) t > ] + L E t [ || γg t ( w t ) || (44) ≤ f ( ˜ w t ) − γ (cid:107)∇ f ( w t ) (cid:107) − (cid:104)∇ f ( w t ) , E t [ (cid:15) t ] (cid:105) + ρ (cid:107) γ ∇ f ( w t ) + E t [ (cid:15) t ] (cid:107) + L ρ E t [ (cid:107) r t (cid:107) ]+ L E t [ (cid:107) (cid:15) t (cid:107) ] + L (cid:104)∇ f ( w t ) , E t [ (cid:15) t ] (cid:105) + Lγ E t (cid:107) g t ( w t ) (cid:107) (45) ≤ f ( ˜ w t ) − γ (cid:107)∇ f ( w t ) (cid:107) + ( L − (cid:107)∇ f ( w t ) (cid:107)(cid:107) E t [ (cid:15) t ] (cid:107) + ρ (cid:0) γ (cid:107)∇ f ( w t ) (cid:107) + (cid:107) E t [ (cid:15) t ] (cid:107) + 2 γ (cid:104)∇ f ( w t ) , E t [ (cid:15) t ] (cid:105) (cid:1) + L ρ E t [ (cid:107) r t (cid:107) ] + L E t [ (cid:107) (cid:15) t (cid:107) ] + Lγ G (46) ≤ f ( ˜ w t ) − γ (cid:107)∇ f ( w t ) (cid:107) + ( L − γ ) (cid:107)∇ f ( w t ) (cid:107)(cid:107) E t [ (cid:15) t ] (cid:107) + γ ρ (cid:107)∇ f ( w t ) (cid:107) + L ρ E t [ (cid:107) r t (cid:107) ] + (cid:107) E t [ (cid:15) t ] (cid:107) + L E t [ (cid:107) (cid:15) t (cid:107) ] + Lγ G (47) = f ( ˜ w t ) − γ (cid:104) − ρ γ (cid:105) (cid:107)∇ f ( w t ) (cid:107) + ( L − γ ) (cid:107)∇ f ( w t ) (cid:107)(cid:107) E t [ (cid:15) t ] (cid:107) + L ρ E t [ (cid:107) r t (cid:107) ] + (cid:107) E t [ (cid:15) t ] (cid:107) + L E t [ (cid:107) (cid:15) t (cid:107) ] + Lγ G (48)Based on [26], we have that E t [ (cid:107) r t (cid:107) ] ≤ − δ ) δ γ G . (49)It follows that T + 1 T (cid:88) t =0 (cid:110) γ (1 − ρ γ ) (cid:107)∇ f ( w t ) (cid:107) − ( L − γ ) (cid:107) E t [ (cid:15) t ] (cid:107)(cid:107)∇ f ( w t ) (cid:107) (cid:111) ≤ T + 1 [ f ( w ) − f ∗ ] + L ρ − δ ) δ γ G + Lγ G + 1 T + 1 T (cid:88) t =0 (cid:20) (cid:107) E t [ (cid:15) t ] (cid:107) + L E t [ (cid:107) (cid:15) t (cid:107) ] (cid:21) (50)Through some further algebraic manipulation, we have that T + 1 T (cid:88) t =0 (cid:18) (cid:107)∇ f ( w t ) (cid:107) − L − γγ (2 − ργ ) (cid:107) E t [ (cid:15) t ] (cid:107) (cid:19) ≤ T + 1 f ( w ) − f ∗ γ (2 − ργ ) + (cid:18) Lρ − δ ) δ + 12 (cid:19) LγG − ργ + 1 T + 1 T (cid:88) t =0 (cid:20) Lγ (2 − ργ ) E t [ (cid:107) (cid:15) t (cid:107) ] + (cid:18) L − γ ) γ (2 − ργ ) (cid:19) (cid:107) E t [ (cid:15) t ] (cid:107) (cid:21) (51)For convenience, let η = L − γγ (2 − ργ ) , and deﬁne a contraction region as follows: C β = {(cid:107)∇ f ( w t ) (cid:107) ≥ ( η + ∆) (cid:107) E t [ (cid:15) t ] (cid:107) } . It follows from (51) that iterates in the BLCD algorithm returns to the contraction region inﬁnitely often with probability one.Further, when setting γ = √ T +1 , we have that T + 1 T (cid:88) t =0 (cid:18) (cid:107)∇ f ( w t ) (cid:107) − L − γγ (2 − ργ ) (cid:107) E t [ (cid:15) t ] (cid:107) (cid:19) ≤ f ( w ) − f ∗ (1 − ργ ) + 1 √ T + 1 (cid:18) Lρ − δ ) δ + 12 (cid:19) LG − ργ + L (2 − ργ ) 1 T + 1 T (cid:88) t =0 E t [ (cid:107) (cid:15) t (cid:107) ] + (cid:18) L − γ ) γ (2 − ργ ) (cid:19) T + 1 T (cid:88) t =0 (cid:107) E t [ (cid:15) t ] (cid:107) (52) B. Solution of Problem (

P1-a ) Since the problem (

P1-a ) is convex, the Lagrangian function is given as: L a ( α , λ ) = MSE ( b , α ) + K (cid:88) k =1 λ k α k . (53)Then, the Karush-Kuhn-Tucker (KKT) conditions are given as follows: ∂L a ( α , λ ) ∂α k =2 (cid:18) M (cid:88) m =1 b km h km x km (cid:19)(cid:18) M (cid:88) m =1 (cid:0) α k b km h km − M (cid:1) x km (cid:19) + 2 σ α k + λ k = 0 , (54) λ k ≥ , λ k α ∗ k = 0 , α ∗ k ≥ . (55)It follows that α ∗ k = max (cid:40) (cid:0) (cid:80) Mm =1 x km (cid:1)(cid:0) (cid:80) Mm =1 b km h km x km (cid:1) M (cid:2) σ + (cid:0) (cid:80) Mm =1 b km h km x km (cid:1) (cid:3) , (cid:41) (cid:44) max (cid:26) ¯ x k β k σ + β k (cid:27) , (56)where the auxiliary variables are deﬁned as β k = (cid:80) Mm =1 b km h km x km and ¯ x k = (cid:80) Mm =1 x km /M . C. Proof of Theorem 2Proof.

Observing that the problem ( P2 ) is deﬁned only in terms of b k , we deﬁne the auxiliary variables ˜ b k = b k , ˜ h k = h k /σ and ˜ x k = 1 /x k , and re-formulate ( P2 ) as: P2-1: min ˜ b K (cid:88) k =1 (˜ b k ˜ h k + ˜ x k ) − s.t. K (cid:88) k =1 ˜ b k ˜ x k ≤ E ˜ b k ≥ , ∀ k, which is convex and can be solved in closed form. Then, we have the Lagrangian as L (˜ b , λ , µ ) = K (cid:88) k =1 (˜ b k ˜ h k + ˜ x k ) − + λ ( ˜ b k ˜ x k − E ) − K (cid:88) k =1 µ k ˜ b k , (57)which leads to the following KKT conditions: ∂L (˜ b , λ , µ ) ∂ ˜ b k = − ˜ h k (˜ b k ˜ h k + ˜ x k ) − + λ ˜ x k − µ k = 0 K (cid:88) k =1 ˜ b ∗ k ˜ x k ≤ E, ˜ b ∗ k ≥ λ ≥ , µ k ≥ λ (cid:18) E − K (cid:88) k =1 ˜ b ∗ k ˜ x k (cid:19) = 0 , µ k ˜ b ∗ k = 0 . For µ k =0, and λ > , we have that ˜ b ∗ k = max (cid:26) (cid:113) ˜ h k ˜ x k λ − ˜ x k ˜ h k , (cid:27) = (cid:20) (cid:113) ˜ h k ˜ x k λ − ˜ x k ˜ h k (cid:21) + (58)with E = (cid:80) Kk =1 ˜ b ∗ k ˜ x k . By combining (58) and E = (cid:80) Kk =1 ˜ b ∗ k ˜ x k , we obtain the following result: E = K (cid:88) k =1 (cid:20) (cid:112) ˜ h k ˜ x k λ (cid:48) − ˜ x k ˜ h k ˜ x k (cid:21) + (59)for λ (cid:48) = (cid:112) /λ . (59) can be solved by using the water-ﬁlling algorithm, where the solution can be found by increasing λ (cid:48) until the equality is satisﬁed. The optimal λ (cid:48) can be plugged int ˜ b ∗ k to yield b ∗ k = (cid:113) ˜ b ∗ k as a solution to ( P2 ). . Distributed Solutions towards Zero Bias and Variance Reduction (Scheme 2) The over-the-air gradient estimation requires a more comprehensive estimator design. To this end, a generalized optimizationproblem is deﬁned for computing the optimal estimator. We deﬁne the MSE cost for the communication error, (cid:15) t , in terms ofthe received signal y = [ y , y , . . . , y K ] as E [( ˆ G − G ) ] = E [( α (cid:12) y − G ) ] = 1 K K (cid:88) k =1 (cid:34) α k (cid:32) M (cid:88) m =1 b km h km x km + n k (cid:33) − G k (cid:35) (60) = 1 K K (cid:88) k =1 (cid:32) E (cid:34) α k (cid:32) M (cid:88) m =1 b km h km x km + n k (cid:33)(cid:35) − E (cid:34) M M (cid:88) m =1 x km (cid:35)(cid:33)(cid:124) (cid:123)(cid:122) (cid:125) e k + E (cid:32) α k (cid:32) M (cid:88) m =1 b km h km x km + n k (cid:33) − E (cid:34) α k (cid:32) M (cid:88) m =1 b km h km x km + n k (cid:33)(cid:35)(cid:33) (cid:124) (cid:123)(cid:122) (cid:125) ν k , (61)where the estimator is ˆ G = α k (cid:16)(cid:80) Mm =1 b km h km x km + n k (cid:17) for k = 1 , , . . . , K . In (61), α k , b km , h km and n k respectivelydenote the correction factor for recovering the k th dimension of the true gradient, the power allocation for the k th dimension ofthe local gradient x km in the m th transmitter, the channel fading coefﬁcient of the k th sub-channel between the m th transmitterand the receiver, and the thermal additive noise for the k th sub-channel. Further, ν k and e k denote the estimator variance andbias, respectively. As apparent in (61), the minimization of the MSE cost does not ensure ˆ G k to be an unbiased estimator of G k . To resolve this issue, we formulate the unbiased optimization problem as argmin { α k } , { b km } K (cid:88) k =1 ν k ( α k , { b km } ) (62)subject to e k ( α k , { b km } ) = 0 , K (cid:88) k =1 b km x km ≤ E m , b km ≥ , α k ≥ , ∀ k = 1 , . . . , K ; ∀ m = 1 , . . . , M, (63)where E m denotes the power budget of the m th transmitter. We note that E m , { x m , x m , . . . , x Km } and { b m , b m , . . . , b Km } are only available to the m th transmitter and α k , ∀ k, are only available to the receiver. The optimization problem deﬁned in(60)-(61) can then be decomposed into two stages. In the ﬁrst stage, each transmitter m utilizes all of its available power andthe local gradient information to compute the optimal power allocation { b m , b m , . . . , b Km } . In the second stage, the receiversolves a consecutive optimization problem for ﬁnding the optimal α k for all k = 1 , . . . , K . The optimization problem at eachtransmitter is formulated as argmax { b km } k =1: K ζ m (64)subject to E (cid:104) ( ζ m x km − b km h km x km ) (cid:105) = 0 , K (cid:88) k =1 b km x km ≤ E m , b km ≥ , ∀ k = 1 , . . . , K. (65)The ﬁrst constraint in (65) ensures that there is no additive bias in the transmitted signal (i.e., the bias can be removed bya multiplicative factor), while the second constraint is the power constraint. The ﬁrst constraint can be restated in a simplerform as ζ m = b km h km , and then the solution can simply be obtained via the KKT conditions asLagrangian: L ( { b km } , { λ k } , ϑ, { β k } ) = ζ m − K (cid:88) k =1 λ k ( b km h km − ζ m ) − ϑ (cid:32) K (cid:88) k =1 b km x km − E m (cid:33) + K (cid:88) k =1 β k b km Stationarity: ∂ L ∂b km = 0 − λ k h km − ϑx km b km + β k = 0 → b km = β k − λ k h km ϑx km Primal Feasibility: b km h km = ζ m , K (cid:88) k =1 b km x km ≤ E m , b km ≥ , ∀ k = 1 . . . , K, Dual Feasibility: ϑ ≥ , β k ≥ , ∀ k = 1 , . . . , K, Comp. Slackness: ϑ (cid:32) K (cid:88) k =1 b km x km − E m (cid:33) = 0 , K (cid:88) k =1 β k b km = 0 . hen, the corresponding solution is given by ζ ∗ m = (cid:118)(cid:117)(cid:117)(cid:116) E m (cid:80) Kk =1 x km h km , b ∗ km = ζ m h km , ∀ k = 1 , . . . , K. (66)(66) illustrates that the m th transmitter utilizes all of its power budget to amplify its transmitted local gradient signal. Then,the corresponding received signal is equivalent to the ζ ∗ m times of the local gradient signal, inducing a multiplicative bias. Yet,this bias can be removed by multiplying the received signal with α in the receiver. However, the received signal at the receiveris a superposition of all transmitted signals because of the over-the-air transmission. Therefore, a single vector of optimal α must be computed for removing the bias and minimizing the estimator variance. To this end, in the second stage, the receiversolves the following optimization problem argmin { α k } , K (cid:88) k =1 ν k ( α k , { b ∗ km } ) (67)subject to e k ( α k , { b ∗ km } ) = 0 , α k ≥ , ∀ k = 1 , . . . , K. (68)Next, we derive e k ( α k , { b ∗ km } ) and ν k ( α k , { b ∗ km } ) . For ease of exposition, the cumbersome steps are omitted here: e k ( α k , { b ∗ km } ) = E (cid:34) α k y k − M M (cid:88) m =1 x km (cid:35) = E (cid:34) α k (cid:32) M (cid:88) m =1 h km b ∗ km x km + n k (cid:33) − M M (cid:88) m =1 x km (cid:35) = α k (cid:32) M (cid:88) m =1 h km b ∗ km x km (cid:33) − M M (cid:88) m =1 x km = α k (cid:32) M (cid:88) m =1 ζ ∗ m x km (cid:33) − M M (cid:88) m =1 x km ν k ( α k , { b ∗ km } ) = E (cid:32) α k (cid:32) M (cid:88) m =1 h km b ∗ km x km + n k (cid:33) − E (cid:34) α k (cid:32) M (cid:88) m =1 h km b ∗ km x km + n k (cid:33)(cid:35)(cid:33)  (69) = E (cid:32) α k (cid:32) M (cid:88) m =1 h km b ∗ km x km − h km b ∗ km x km + n k (cid:33)(cid:33)  = α k σ , (70)where the expectation is taken with respect to n k for the realizations of all x km and σ = E [ n k ] . By solving the KKTconditions, we obtain the following solution α ∗ k = M (cid:80) Mm =1 x km (cid:80) Mm =1 h km b ∗ km x km = M (cid:80) Mm =1 x km (cid:80) Mm =1 ζ ∗ m x km . (71)From the implementation point of view, it is sensible to set α † k (cid:39) (cid:80) Mm =1 ζ ∗ m since { x km } is not available at the receiver.We herein also notice that ζ ∗ m , for all m = 1 , . . . , M , are not available at the receiver as well. Luckily, α † k is a function of (cid:80) Mm =1 ζ ∗ m , and hence a subchannel could be allocated for the over-the-air transmission of (cid:80) Mm =1 ζ ∗ m . Subsequently, it followsthat the bias is given by e k ( α ∗ k , { b ∗ km } ) = E (cid:34) (cid:80) Mm =1 ζ ∗ m y k − M M (cid:88) m =1 x km (cid:35) = E (cid:34) (cid:80) Mm =1 ζ ∗ m (cid:32) M (cid:88) m =1 h km b ∗ km x km + n k (cid:33) − M M (cid:88) m =1 x km (cid:35) = E (cid:34) (cid:80) Mm =1 ζ ∗ m (cid:32) M (cid:88) m =1 ζ ∗ m x km + n k (cid:33) − M M (cid:88) m =1 x km (cid:35) = E (cid:34) (cid:80) Mm =1 ζ ∗ m x km + n k (cid:80) Mm =1 ζ ∗ m − M M (cid:88) m =1 x km (cid:35) . (72)Assuming that the distributions of { x km } , for all k = 1 , . . . , K ; m = 1 , . . . , M , are identical across the subchannels and users,and so are { h km } , ζ ∗ m can be simpliﬁed. For ease of exposition, we denote E [ x km ] = ϕ + ¯ x and E (cid:104) h km (cid:105) = (cid:36) . When thenumber of subchannels, K , is large, we have that ζ ∗ m = √ E m (cid:114)(cid:80) Kk =1 x km h km = ⇒ when K is large ζ ∗ m = √ E m (cid:114) K E [ x km ] E (cid:104) h km (cid:105) = √ E m (cid:112) K ( ϕ + ¯ x ) (cid:36) (73) e k ( α ∗ k , { b ∗ km } ) = E (cid:34) (cid:80) Mm =1 ζ ∗ m x km + n k (cid:80) Mm =1 ζ ∗ m − M M (cid:88) m =1 x km (cid:35) = M (cid:88) m =1 (cid:34) √ E m (cid:80) Mm =1 √ E m − M (cid:35) x km . (74)Then, the norm (cid:107) E t [ (cid:15) t ] (cid:107) in Theorem 1 can be expressed in terms of e k , ∀ k, as (cid:107) E t [ (cid:15) t ] (cid:107) = (cid:80) Kk =1 E t [ (cid:15) t ] k = (cid:80) Kk =1 e k .Similarly, the variance ν k can also be computed as ν k ( α ∗ k , { b ∗ km } ) = α ∗ k σ =  (cid:80) Mm =1 (cid:113) E m K(cid:36) ( ϕ +¯ x )  σ = K(cid:36) ( ϕ + ¯ x ) (cid:16)(cid:80) Mm =1 √ E m (cid:17) σ . (75)inally, the MSE cost in Theorem 1 can be written as E t [ (cid:107) (cid:15) t (cid:107) ] = (cid:80) Kk =1 ( ν k + e k ) . E. Alternative Formulations and Baselines

A Receiver Centric Approach.

In what follows, we take a receiver-centric approach by selecting a ﬁxed estimator, α k = Mp , ∀ k , at the receiver. Given the ﬁxed estimator, the MSE objective function is set asMSE (1 / ( M p ) , b ) = K (cid:88) k =1 M (cid:88) m =1 (cid:0) b km h km M p − M (cid:1) x km + K (cid:88) k =1 M (cid:88) m =1 M (cid:88) m (cid:48) =1 m (cid:48) (cid:54) = m (cid:0) b km h km M p − M (cid:1) × (cid:0) b km (cid:48) h km (cid:48) M p − M (cid:1) x km x km (cid:48) + σ KM p . We note that the ﬁrst term can be decomposed for different devices and be solved in a distributed manner. The second termis coupled across different users, and the third term is only expressed in terms of the variable p . If there were no powerconstraints, the solution would be b km h km = p for a very large p . With this intuition, it is sensible to assign most (if not all)of the multipliers ( b km h km Mp − M ) zero. That is to say, p should be selected so that it ﬁts most users to provide enough powerfor the transmission of all the dimensions, while every user individually solves the power allocation problem using the ﬁrstterm for a given value of p , i.e., P2-1: min { b km } K (cid:88) k =1 (cid:0) b km h km M p − M (cid:1) x km s.t. K (cid:88) k =1 b km x km ≤ E. The optimal solution { b ∗ km } to the problem ( P2-1 ) can be obtained by ﬁnding λ ∗ ≥ satisfying the following two inequalities: K (cid:88) k =1 ( b ∗ km ) x km ≤ E, b ∗ km = x km ph km ( x km − M p λ ∗ m ) . (76)When λ ∗ = 0 , then the solution is b ∗ km = p/h km , the objective function is minimized to 0 and the power constraint is satisﬁedwith strict inequality. On contrary, when λ ∗ > , then the power constraint is satisﬁed with equality and the optimal solution b ∗ km deviates from p/h km . Then, the optimal solution { b ∗ km } to the problem ( P2-1 ) can be obtained by ﬁnding λ ∗ > satisfyingthe following two equalities: K (cid:88) k =1 ( b ∗ km ) x km = E, b ∗ km = x km ph km ( x km − M p λ ∗ m ) . (77) An equal power allocation approach.

For comparison, we also consider the equal power allocation scheme, in whichequal power is allocated to each dimension, i.e., we set b km = b m . Enforcing the power constraint of the devices, e.g., (cid:80) Kk =1 b km x km = E , leads to the equal power solution that can be written as b ∗ m = (cid:113) E/ (cid:80) Kk =1 x km . Therefore, each deviceapplies the power b ∗ m in each dimension, taking the advantage of the distribution of the data which is independent and identical. An Alternative Formulation to P3.

When K is large, the channel coefﬁcients { h km } and the local gradients { x km } are independent and identically distributed(i.i.d.) across all users and subchannels. Hence, the optimization problem P3 can be further simpliﬁed via the law of largenumbers as P4: min ˜ b E (cid:20) b km ˜ h km + ˜ x km (cid:21) s.t. E [˜ b km / ˜ x km ] ≤ ¯ E m , ˜ b km ≥ . (78)We here notice that the general law of large numbers only applies if all ˜ b km are identically distributed. By further noting that ˜ b km depends on ˜ x km and ˜ h km , we conclude that the objective function for any k is identically distributed in (78) . Then, (78)simpliﬁes as P4-1: min ˜ b km E (cid:20) b km ˜ h km + ˜ x km (cid:21) s.t. E (cid:34) ˜ b km ˜ x km (cid:35) ≤ ¯ E m , b km ≥ . (79)Subsequently, the KKT conditions are written asLagrangian: L = E (cid:20) b km ˜ h km + ˜ x km (cid:21) + λ m (cid:32) E (cid:34) ˜ b km ˜ x km (cid:35) − ¯ E m (cid:33) − β ˜ b km (80)Stationarity: ∂ L ∂ ˜ b km = E (cid:34) λ m ˜ x km − ˜ h km (˜ b km ˜ h km + ˜ x km ) (cid:35) = 0 ⇐ λ ∗ m ˜ x km = ˜ h km (˜ b km ˜ h km + ˜ x km ) (81)Primal Feasibility: E (cid:34) ˜ b km ˜ x km (cid:35) ≤ ¯ E m , ˜ b km ≥ , (82)Dual Feasibility: λ m ≥ , β ≥ , (83)Comp. Slackness: λ m (cid:32) E (cid:34) ˜ b km ˜ x km (cid:35) − ¯ E m (cid:33) = 0 , − β ˜ b km = 0 . (84)Then, the solution is computed as follows ˜ b km = (cid:34)(cid:115) ˜ x km λ ∗ m ˜ h km − ˜ x km ˜ h km (cid:35) + ⇒ b km = (cid:118)(cid:117)(cid:117)(cid:116)(cid:34) σh km | x km | (cid:112) λ ∗ m − σ x km h km (cid:35) + (85) λ ∗ m < ˜ h km ˜ x km ⇒ ˜ b km > and λ ∗ m < h km x km σ ⇒ b km > (86) E (cid:34) ˜ b km ˜ x km (cid:35) = (cid:90) ∞ (cid:90) ∞ ˜ b km ˜ x km p (˜ h km ) p (˜ x km ) d ˜ h km d ˜ x km (87) = (cid:90) ∞ (cid:90) λ m ˜ x km (cid:32)(cid:115) λ ∗ m ˜ h km ˜ x km − h km (cid:33) p (˜ h km ) p (˜ x km ) d ˜ h km d ˜ x km = ¯ E m (88) E (cid:2) b km x km (cid:3) = (cid:90) ∞−∞ (cid:90) ∞ (cid:34) σh km | x km | (cid:112) λ ∗ m − σ x km h km (cid:35) + x km p ( h km ) p ( x km ) dh km dx km (89) = (cid:90) ∞−∞ (cid:90) ∞ (cid:114) σ λ ∗ mx km (cid:32) | x km | σh km (cid:112) λ ∗ m − σ h km (cid:33) p ( h km ) p ( x km ) dh km dx km = ¯ E m . (90)By using the results above, the bias term, e k , is derived as e k ( α ∗ k , { b km } ) = E [ α ∗ k y k ] − M M (cid:88) m =1 x km = E (cid:34) M (cid:88) m =1 (cid:18) α ∗ k b km h km − M (cid:19) x km + α ∗ k n k (cid:35) (91) = E  M (cid:88) m =1  M (cid:118)(cid:117)(cid:117)(cid:116)(cid:34) σ | x km | h km (cid:112) λ ∗ m − σ x km h km (cid:35) + h km − M  x km + n k M  (92) = 1 M (cid:88) m ∈S † (cid:32)(cid:115) σh km | x km | (cid:112) λ ∗ m − σ x km − (cid:33) x km , (93)where S † represents the set of transmitters for which (86) is satisﬁed. To compute the MSE cost, we compute the variance ν ( α ∗ k , { b km } ) by taking the expectation with respect to x km ν k ( α ∗ k , { b km } ) = E (cid:32) M (cid:88) m =1 α ∗ k b km h km x km + α ∗ k n k − E (cid:34) M (cid:88) m =1 α ∗ k b km h km x km (cid:35)(cid:33)  (94) = E  M (cid:88) m =1 M (cid:118)(cid:117)(cid:117)(cid:116)(cid:34) σh km | x km | (cid:112) λ ∗ m − σ x km (cid:35) + x km − (cid:118)(cid:117)(cid:117)(cid:116)(cid:34) σh km | x km | (cid:112) λ ∗ m − σ x km (cid:35) + x km  + n k M   = σ M . (95)Consequently, the cost E t (cid:2) (cid:107) (cid:15) t (cid:107) (cid:3) in Theorem 1 is computed as E t [ (cid:107) (cid:15) t (cid:107) ] = (cid:80) Kk =1 ( e k + ν k ))