[PDF] Federated Learning over Noisy Channels: Convergence Analysis and Design Examples

Abstract

Does Federated Learning (FL) work when both uplink and downlink communications have errors? How much communication noise can FL handle and what is its impact to the learning performance? This work is devoted to answering these practically important questions by explicitly incorporating both uplink and downlink noisy channels in the FL pipeline. We present several novel convergence analyses of FL over simultaneous uplink and downlink noisy communication channels, which encompass full and partial clients participation, direct model and model differential transmissions, and non-independent and identically distributed (IID) local datasets. These analyses characterize the sufficient conditions for FL over noisy channels to have the same convergence behavior as the ideal case of no communication error. More specifically, in order to maintain the O(1/T) convergence rate of FedAvg with perfect communications, the uplink and downlink signal-to-noise ratio (SNR) for direct model transmissions should be controlled such that they scale as O(t^2) where t is the index of communication rounds, but can stay constant for model differential transmissions. The key insight of these theoretical results is a "flying under the radar" principle - stochastic gradient descent (SGD) is an inherent noisy process and uplink/downlink communication noises can be tolerated as long as they do not dominate the time-varying SGD noise. We exemplify these theoretical findings with two widely adopted communication techniques - transmit power control and diversity combining - and further validating their performance advantages over the standard methods via extensive numerical experiments using several real-world FL tasks.

Full PDF

11 Federated Learning over Noisy Channels:Convergence Analysis and Design Examples

Xizixiang Wei Cong Shen

Abstract

Does Federated Learning (FL) work when both uplink and downlink communications have errors?How much communication noise can FL handle and what is its impact to the learning performance?This work is devoted to answering these practically important questions by explicitly incorporating bothuplink and downlink noisy channels in the FL pipeline. We present several novel convergence analysesof FL over simultaneous uplink and downlink noisy communication channels, which encompass fulland partial clients participation, direct model and model differential transmissions, and non-independentand identically distributed (IID) local datasets. These analyses characterize the sufﬁcient conditions forFL over noisy channels to have the same convergence behavior as the ideal case of no communicationerror. More speciﬁcally, in order to maintain the O (1 /T ) convergence rate of F ED A VG with perfectcommunications, the uplink and downlink signal-to-noise ratio (SNR) for direct model transmissionsshould be controlled such that they scale as O ( t ) where t is the index of communication rounds, butcan stay constant for model differential transmissions. The key insight of these theoretical results is a“ﬂying under the radar” principle – stochastic gradient descent (SGD) is an inherent noisy process anduplink/downlink communication noises can be tolerated as long as they do not dominate the time-varyingSGD noise. We exemplify these theoretical ﬁndings with two widely adopted communication techniques– transmit power control and diversity combining – and further validating their performance advantagesover the standard methods via extensive numerical experiments using several real-world FL tasks. I. I

NTRODUCTION

Federated learning (FL) [1], [2] is an emerging distributed machine learning paradigm that has manyattractive properties which can address new challenges in machine learning (ML). In particular, FL ismotivated by the growing trend that massive amount of the real-world data are exogenously generated at

The authors are with the Charles L. Brown Department of Electrical and Computer Engineering, University of Virginia,Charlottesville, VA 22904, USA. E-mail: { xw8cw,cong } @virginia.edu .The work was partially supported by the US National Science Foundation (NSF) under Grant ECCS-2033671. a r X i v : . [ c s . I T ] J a n the edge devices. For better privacy protection, it is desirable to keep the data locally at the device andenable distributed model training, which has motivated the development of FL. The power of FL hasbeen realized in commercial devices (e.g., Google Pixel 2 uses FL to train ML models to personalizeuser experience) and ML tasks (e.g., Gboard uses FL for keyboard prediction) [3].Communication efﬁciency has been at the front and center of FL ever since its inception [1], [2], andit is widely regarded as one of its primary bottlenecks [4], [5]. Early research has largely focused oneither reducing the number of communication rounds [1], [6], or decreasing the size of the payloadfor transmission [5], [7]–[9]. This is because in most FL literature that deals with communicationefﬁciency, it is often assumed that a perfect communication “tunnel” has been established, and the taskof improving communication efﬁciency largely resides on the ML design that trades off computation andcommunication. More recent research starts to ﬁll this gap by focusing on the communication systemdesign, particularly for wireless FL [10], [11] where the underlying communication is unreliable; seeSection II for an overview. Nevertheless, the focus has been on resource allocation [12]–[14], deviceselection [15]–[18], or either uplink or downlink (but not both) cellular system design [13], [19].While the early studies provide a glimpse of the potential of optimizing the communication designfor FL, the important issue of noisy communications for both uplink (clients send local models to theparameter server) and downlink (server sends global model to clients) have not been well investigated.In particular, it is often taken for granted that standard communication techniques can be directly appliedto FL. We show in this paper that this can be highly suboptimal because they are mostly designedfor independent and identically distributed (IID) sources (over time), while the communicated local andglobal models in both communication directions of FL represent a long-term process consisting of manyprogressive rounds of model updating that collectively determine the ﬁnal learning outcome. Channelnoise or bit/packet error rates cannot be directly translated to the ultimate ML model accuracy andconvergence rate. It is thus important to rethink the communication designs in both uplink and downlinkthat cater to the unique characteristics of FL.The goal of this paper is to answer the following questions: how much communication noise can FLhandle, and what is its impact to the learning performance? Towards this end, we ﬁrst describe a completeFL system where both model upload and download take place over noisy channels. We then present theﬁrst major contribution of this work – novel convergence analyses of the standard F ED A VG scheme undernon-IID datasets, full or partial clients participation, direct model or model differential transmissions, andnoisy downlink and uplink channels. These analyses are based on very general noise assumptions, andhence are broadly applicable to a variety of practical communication systems. More importantly, theanalyses reveal that, in order to maintain the same O (1 /T ) convergence rate of F ED A VG with perfect communications (the ideal case), the uplink and downlink signal-to-noise-ratio (SNR) should be controlledsuch that they scale as O ( t ) where t is the index of communication rounds, when the updated localor global model is directly transmitted over the noisy channel. Furthermore, this holds for both full andpartial clients participation, the latter of which is signiﬁcantly more difﬁcult to analyze. On the other hand,when only model differential is transmitted on the uplink, we show that a constant SNR is sufﬁcient toachieve the O (1 /T ) convergence rate, despite the fact that both uplink and downlink noise instances existin the aggregated global model. The key insight of these theoretical results is a “ﬂying under the radar”principle: stochastic gradient descent (SGD) is inherently a noisy process, and as long as uplink/downlinkchannel noises do not dominate the SGD noise during model training, which is controlled by the time-varying learning rate, the convergence behavior is not affected. This general principle is exempliﬁed withtwo widely adopted communication techniques – transmit power control and diversity combining – bycontrolling the resulting SNR to satisfy the theoretical analyses. Comprehensive numerical evaluationson three widely adopted ML tasks with increasing difﬁculties, MNIST , CIFAR-10 and

Shakespeare , arecarried out using these techniques. We design a series of experiments to demonstrate that the ﬁne-tuned transmit power control and diversity combining that are guided by the theoretical analyses cansigniﬁcantly outperform the equal-SNR-over-time baseline, and in fact can approach the ideal noise-freecommunication performance in many of the experiment settings. These results not only corroborate thetheoretical conclusions but also highlight the beneﬁt of designing communication algorithms that arespeciﬁcally tailored to the characteristics of FL.The remainder of this paper is organized as follows. Related works are surveyed in Section II. Thesystem model that captures the noisy channels in both uplink and downlink of FL is described in SectionIII. Theoretical analyses are presented in Section IV for three different FL conﬁgurations. These resultsinspire novel communication designs, for which two examples of transmit power control and diversitycombining are presented in Section V. Experimental results are given in Section VI, followed by theconclusions in Section VII. All technical proofs are given in the Appendices.II. R

ELATED W ORKS

Federated Learning [1] focuses on many clients collaboratively training a ML model under the co-ordination of a central server while keeping the local training data private at each user. It has beenextensively studied in recent years in the machine learning and artiﬁcial intelligence community, whichaims to address various questions around improving machine learning efﬁciency and effectiveness [20]–[23], preserving the privacy of user data [24], [25], robustness to attacks and failures [26], and ensuringfairness and addressing sources of bias [3], [27].

Among the various challenges associated with FL, improving the communication efﬁciency has attracteda lot of interest [4], [5]. One of the representative approaches for reducing the communication roundsis F ED A VG [1], which allows periodic model aggregation and local model updates and thus enablesﬂexible communication-computation tradeoff [6]. Theoretical understanding of this tradeoff has beenactively pursued and, depending on the underlying assumptions (e.g., IID or non-IID local datasets,convex or non-convex loss functions, gradient descent or stochastic gradient descent), rigorous analysisof the convergence behavior has been carried out [20], [21], [23], [28]. For the approach of reducingthe size of the exchanged messages in each round, general discussions on sparsiﬁcation, subsampling,and quantization are given in [2]. There have been recent efforts in developing quantization and sourcecoding to reduce the communication cost [7]–[9], [19], [29]–[31]. Nevertheless, they mostly focus oneither uplink or downlink but not both, and largely consider only full clients participation.Recent years have also seen increased effort in the communication algorithm, protocol, and systemdesign of FL. The inherent trade-off between local model update and global model aggregation is studiedin [32] to optimize over transmission power/rate and training time. Various radio resource allocation andclient selection policies [12], [16]–[18], [33]–[35] have been proposed to minimize the learning loss orthe training time. Joint communication and computation is investigated [13], [15], [30], [36], [37]. Inparticular, the analog aggregation design [13], [30], [37], [38] serves as one of our design examples inSection V. The effect of noise has been studied in [39], [40], but the emphasis has been on how tomodify the machine learning methods to reduce noise, while our work keeps the model training aspectof FL unchanged and focuses on the communication design that controls the effective noise.III. S YSTEM M ODEL

We ﬁrst introduce the standard FL problem formulation, and then describe a complete FL pipelinewhere both local model upload and global model download take place over noisy channels.

A. FL Problem Formulation

The general federated learning problem setting studied in this paper mostly follows the standard modelin the original paper [1]. In particular, we consider a FL system with one central parameter server (e.g.,base station) and a set of at most N clients (e.g., mobile devices). Client n ∈ [ N ] (cid:44) { , , · · · , N } storesa local dataset D n = { z i } D n i =1 , with its size denoted by D n , that never leaves the client. Datasets acrossclients are assumed to be non-IID and disjoint. The maximum data size when all clients participate in FLis D = (cid:80) Nn =1 D n . The loss function f ( w , z ) measures how well a ML model with parameter w ∈ R d ﬁts a single data sample z . Without loss of generality, we assume that w has zero-mean and unit-varianceelements , i.e., E || w i || = 1 , ∀ i ∈ [ d ] . For the n -th client, its local loss function F n ( · ) is deﬁned by F n ( w ) (cid:44) D n (cid:88) z ∈D n f ( w , z ) . The goal of FL is for the parameter server to learn a global machine learning (ML) model based onthe distributed local datasets at the N clients, by coordinating and aggregating the training processes atindividual clients without accessing the raw data. Speciﬁcally, the global optimization objective over all N clients is given by F ( w ) (cid:44) N (cid:88) n =1 D n D F n ( w ) = 1 D N (cid:88) n =1 (cid:88) z ∈D n f ( w , z ) . (1)The global loss function measures how well the model ﬁts the entire corpus of data on average. Thelearning objective is to ﬁnd the best model parameter w ∗ that minimizes the global loss function: w ∗ =arg min w F ( w ) . Let F ∗ and F ∗ k be the minimum value of F and F k , respectively. Then, Γ = F ∗ − N (cid:80) Nk =1 F ∗ k quantiﬁes the degree of non-IID as shown in [23]. B. FL over Noisy Uplink and Downlink Channels

We study a generic FL framework where partial client participation and non-IID local datasets, twocritical features that separate FL from conventional distributed ML, are explicitly captured. Unlike theexisting literature, we focus on imperfect communications and consider that both the upload and downloadtransmissions take place over noisy communication channels. The overall system diagram is depicted inFig. 1. In particular, the FL-over-noisy-channel pipeline works by iteratively executing the following stepsat the t -th learning round, ∀ t ∈ [ T ] . (1) Downlink communication for global model download. The centralized server broadcasts thecurrent global ML model, which is described by the latest weight vector w t − from the previous round,to a set of uniformly randomly selected clients denoted as S t with |S t | = K . Because of the imperfectionintroduced in communications, client k receives a noisy version of w t − , which is written as ˆw kt − = w t − + e kt , (2) The parameter normalization and de-normalization method can be found in the appendix in [13]. We note that for partial clients participation, we have

K < N ; in the case of full clients participation we have K = N . Fig. 1. End-to-end FL system diagram in the t th communication round. The impact of noisy channels in both uplink anddownlink is captured. where e kt = [ e kt, , · · · , e kt,d ] T ∈ R d is the d -dimensional downlink effective noise vector at client k andtime t . We assume that e kt is a zero-mean random vector consisting of IID elements with variance: E || e kt,i || = ζ t,k and E || e kt || = dζ t,k , ∀ t ∈ [ T ] , k ∈ S t , i ∈ [ d ] . (3)Note that the effective noise term does not necessarily correspond to only the channel noise; it alsocaptures post-processing errors in a communication transceiver, such as estimation, decoding and de-modulation, frequency offset and phase noise, etc. We further note that the noise assumption is verymild – Eqn. (3) only requires a bounded variance of the random noise, but does not limit to anyparticular distribution. We deﬁne the local (post-processing) receive SNR for the k -th client at the t -th communication round as SNR L t,k = E || w t − || E || e kt || = 1 ζ t,k , where we assume without loss of generality that the transmit power of each scalar model parameter isnormalized: E || w t − ,i || = 1 , ∀ i ∈ [ d ] . Note that this corresponds to the normalized weight described inSection III-A. In addition, we remark that the downlink communication model is very general in the sensethat the effective noise variances, { ζ t,k } , can be different for different clients and at different rounds. (2) Local computation. Each client uses its local data to train a local ML model improved upon thereceived global ML model. In this work, we assume that mini-batch stochastic gradient descent (SGD)is used in the model training. Note that this is the most commonly adopted training method in modernML tasks, e.g., deep neural networks, but its analysis is much more complicated than gradient descent(GD) when communication noise is present.

Speciﬁcally, mini-batch SGD operates by updating the weight iteratively (for E steps in each learninground) at client k as follows:Initialization: w kt, = ˆw kt − , Iteration: w kt,τ = w kt,τ − − η t ∇ F k ( w kt,τ − , ξ kτ ) , ∀ τ = 1 , · · · , E, Output: w kt = w kt,E , where ξ kτ is a batch of data points that are sampled independently and uniformly at random from thelocal dataset of client k in the τ -th iteration of mini-batch SGD. (3) Uplink communication for local model upload. The K participating clients upload their latest localmodels to the server. More speciﬁcally, client k transmits a vector x kt to the server at the t -th round. Weagain consider the practical case where the upload communication is erroneous, and the server receives anoisy version of the individual weight vectors from each client due to various imperfections in the uplinkcommunications (e.g. channel noise, transmitter and receiver distortion, processing error). The receivedvector for client k can be written as ˆx kt = x kt + n kt , (4)where n kt ∈ R d is the d -dimensional effective uplink noise vector for decoding client k ’s model at time t . We assume that n kt is a zero-mean random vector consisting of IID elements with bounded variance: E || n kt,i || = σ t,k and E || n kt || = dσ t,k , ∀ t ∈ [ T ] , k ∈ S t , i ∈ [ d ] . (5)We again note that the uplink communication model in Eqn. (5) is very general in the sense that (1) onlybounded variance is assumed as opposed to the speciﬁc noise distribution; and (2) the effective noisevariances, { σ t,k } , can be different for different clients and at different rounds.Two different choices of the vector x kt for model upload are considered in this paper.1) Model Transmission (MT).

The K participating clients upload the latest local models w kt : x kt = w kt . Following (4), the server receives the updated local model of client k as ˜w kt = ˆx kt = w kt + n kt . (6)2) Model Differential Transmission (MDT).

The K participating clients only upload the differences between the latest local model and the previously received (noisy) global model, i.e., x kt = d kt (cid:44) w kt − ˆw kt − . For MDT, the server uses d kt and the previously computed global model w t − to reconstruct theupdated local model of client k as ˜w kt = w t − + ˆx kt = w t − + d kt + n kt = w kt + n kt − e kt . (7)The SNR for these two models, however, has to be deﬁned slightly differently because we have normalizedthe ML model parameter w to have unit-variance elements in Section III-A. Thus, for MT, we can writethe receive SNR at the server for k -th client’s signal as SNR

S,MT t,k = E (cid:13)(cid:13) w kt (cid:13)(cid:13) E (cid:13)(cid:13) n kt (cid:13)(cid:13) = 1 σ t,k . (8)For MDT, we keep the SNR expression general since the variance of model difference d kt is unknown apriori and also changes over time. We have: SNR

S,MDT t,k = E (cid:13)(cid:13) d kt (cid:13)(cid:13) E (cid:13)(cid:13) n kt (cid:13)(cid:13) = E (cid:13)(cid:13) d kt (cid:13)(cid:13) dσ t,k . (9) Remark 1.

Both schemes can be useful in practice, depending on the speciﬁc requirements of the FLsystem. For example, transmitting weight differential relies on the server keeping the previous globalmodel w t − , so that the new local models can be reconstructed. This however may not always be true ifthe server deletes intermediate model aggregation for privacy preservation [3], which makes reconstructionfrom the model differential infeasible. Remark 2.

It is worth noting that the different choices of what to transmit are not considered in mostof the existing literature because, without considering the communication error, there is no differencebetween them from a pure learning perspective – as long as the server can reconstruct w kt , this aspectdoes not impact the learning performance [1]. However, as we see shortly, the choice becomes signiﬁcantwhen communication errors are present. Remark 3.

Both uplink and downlink channel noises collectively impact the received local models atthe server. This may be more obvious for MDT than MT (Eqn. (7) explicitly involves both noise terms).However, we note that the latest local model is trained using the previously received global model,which contains the downlink noise. As a result, both MT and MDT will be impacted by both uplink and downlink channel noises. (4) Global aggregation.

The server aggregates the received local models to generate a new global MLmodel, following the standard F ED A VG [1]: w t = (cid:88) k ∈S t D k (cid:80) i ∈S t D i ˜w kt . The server then moves on to the ( t + 1) -th round. For ease of exposition and to simply the analysis, weassume in the remainder of the paper that the local dataset sizes at all clients are the same : D i = D j , ∀ i, j ∈ [ N ] , which leads to the following simpliﬁcations.1) MT.

The aggregation can be simpliﬁed as w t = 1 K (cid:88) k ∈S t ˜w kt = 1 K (cid:88) k ∈S t ˆx kt = 1 K (cid:88) k ∈S t (cid:16) w kt + n kt (cid:17) . (10)2) MDT.

The aggregation can be written as w t = 1 K (cid:88) k ∈S t ˜w kt = w t − + 1 K (cid:88) k ∈S t ˆx kt = 1 K (cid:88) k ∈S t (cid:16) w kt + n kt − e kt (cid:17) . (11)For the case of MT, the SNR for the global model (after aggregation) can be written as SNR Gt = E || (cid:80) k ∈S t w kt || E || (cid:80) k ∈S t n kt || = E || (cid:80) k ∈S t w kt || dσ t , (12)and for MDT, the SNR for the global model can be written as SNR Gt = E || (cid:80) k ∈S t w kt || E || (cid:80) k ∈S t ( n kt − e kt ) || = E || (cid:80) k ∈S t w kt || d ( σ t + ζ t ) , (13)where σ t (cid:44) (cid:80) k ∈S t σ t,k and ζ t (cid:44) (cid:80) k ∈S t ζ t,k denote the total uplink and downlink effective noise powerfor participating clients, respectively. Remark 4.

We note that the total signal power (numerators in Eqns. (12) and (13)) is difﬁcult to evaluate.In general, { w kt } are correlated across clients because the local model updates all start from (roughly)the same global model. Intuitively, once FL convergences, these models will largely be the same, leadingto a signal power term of dK for the numerator. On the other hand, if we assume that these localmodels are independent across clients, which is reasonable in the early phases of FL with large localepochs, where the (roughly) same starting point has diminishing impact due to the long training periodand non-IID nature of the data distribution, we can have a signal power term of dK . Nevertheless, since We emphasize that all the results of this paper can be extended to handle different local dataset sizes. the SNR control can be realized by adjusting the effective noise power levels, we focus on the impactof σ t and ζ t on the FL performance in Section IV.IV. C ONVERGENCE A NALYSIS OF FL OVER N OISY C HANNELS

A. Convergence Analysis for Model Transmission for Full Clients Participation

We ﬁrst analyze the convergence of F ED A VG in the presence of both uplink and downlink communica-tion noise when direct model transmission (MT) is adopted for local model upload: x kt = w kt . To simplythe analysis and highlight the key techniques in deriving the convergence rate, we assume K = N inthis subsection (i.e., full clients participation with S t = [ K ] = [ N ] ), and leave the case of partial clientsparticipation to Section IV-B.We make the following standard assumptions that are commonly adopted in the convergence analysisof F ED A VG and its variants; see [7], [8], [20], [22], [23], [41]. In particular, Assumption 1-2) indicatesthat we focus on strongly convex F k ( · ) , which represents a category of loss functions that are widelystudied in the literature. Assumption 1. L -smooth: ∀ v and w , F k ( v ) ≤ F k ( w ) + ( v − w ) T ∇ F k ( w ) + L (cid:107) v − w (cid:107) .2) µ -strongly convex: ∀ v and w , F k ( v ) ≥ F k ( w ) + ( v − w ) T ∇ F k ( w ) + µ (cid:107) v − w (cid:107) .3) Bounded variance for unbiased mini-batch SGD:

The mini-batch SGD is unbiased: E [ ∇ F k ( w , ξ )] = ∇ F k ( w ) , and the variance of stochastic gradients is bounded: E (cid:107)∇ F k ( w , ξ ) − ∇ F k ( w ) (cid:107) ≤ δ k , formini-batch data ξ at client k ∈ [ N ] .4) Uniformly bounded gradient: E (cid:107)∇ F k ( w , ξ ) (cid:107) ≤ H for mini-batch data ξ at client k ∈ [ N ] . We present the main convergence result of MT with full clients participation in Theorem 1, whoseproof can be found in Appendix A.

Theorem 1.

Deﬁne κ = Lµ , γ = max { κ, E } . Set learning rate as η t = µ ( γ + t ) and adopt a SNR controlpolicy that scales the effective uplink and downlink noise power over t such that: σ t ≤ N µ ( γ + t − ∼ O (cid:18) t (cid:19) (14) ζ t ≤ N µ ( γ + t )( γ + t − ∼ O (cid:18) t (cid:19) . (15) where σ t (cid:44) (cid:80) k ∈ [ N ] σ t,k and ζ t (cid:44) (cid:80) k ∈ [ N ] ζ t,k denote the total uplink and downlink effective noisepower, respectively. Then, under Assumption 1, the convergence of F ED A VG with non-IID datasets and full clients participation satisﬁes E (cid:107) w T − w ∗ (cid:107) ≤ γ + T (cid:20) Dµ + (8 κ + E ) (cid:107) w − w ∗ (cid:107) (cid:21) (16) with D = N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H + 2 d. Theorem 1 guarantees that even under simultaneous uplink and downlink noisy communications, thesame O (1 /T ) convergence rate of F ED A VG with perfect communications can be achieved if we controlthe effective noise power of both uplink and downlink to scale at rate O (1 /t ) and choose the learningrate at O (1 /t ) over t . We note that the choice of η t to scale as O (1 /t ) is well-known in distributedand federated learning [20]–[23], [41]–[43], which essentially controls the “noise” that is inherent to thestochastic process in SGD to gradually shrink as the FL process converges. The fundamental idea thatleads to Theorem 1 is to control the “effective channel noise” to not dominate the “effective SGD noise” ,i.e., to always have the effective channel noise ﬂoor to be below that of the SGD noise. This is reﬂectedby the effective noise power requirement in Eqns. (14) and (15).We further note that although the requirement of Theorem 1 is presented in terms of the effectivenoise power, what ultimately matters is the signal-to-noise ratio deﬁned in Section III-B. There existsignal processing and communication techniques that can satisfy the requirement by either increasing thesignal power (e.g., transmit power control) or reducing the post-processing noise power (e.g., diversitycombining). We discuss design examples that realize the requirement of Theorem 1 in Section V. B. Convergence Analysis for Model Transmission for Partial Clients Participation

We now generalize the convergence analysis for full clients participation to partial clients participation,where we have a given

K < N and uniformly randomly select a set of clients S t at round t to carry outthe FL process. In this section, we mostly follow the FL system model described in Section III-B, withthe only simpliﬁcation that we consider homogeneous noise power levels at the uplink and downlink,i.e., we assume σ t,k = ¯ σ t , and ζ t,k = ¯ ζ t , ∀ t ∈ [ T ] , k ∈ [ N ] . (17)The main reason to introduce this simpliﬁcation is due to the time-varying randomly participatingclients: since S t changes over t , the total power levels also vary over t if we insist on heterogeneousnoise power for different clients. Furthermore, since clients are randomly selected, the total power levelbecomes a random variable as well, which signiﬁcantly complicates the convergence analysis. Making this assumption would allow us to focus on the challenge with respect to the model update from partialclients participation. Theorem 2.

Let κ , γ and η t be the same as in Theorem 1. Adopt a SNR control policy that scales theeffective uplink and downlink noise power over t such that: ¯ σ t ≤ Kµ ( γ + t − ∼ O (cid:18) t (cid:19) (18) ¯ ζ t ≤ Nµ ( γ + t )( γ + t − ∼ O (cid:18) t (cid:19) . (19) where ¯ σ t and ¯ ζ t represent the individual client effective noise in the uplink and downlink, respectively,which are deﬁned in Eqn. (17) . Then, under Assumption 1, the convergence of F ED A VG with non-IIDdatasets and partial clients participation has the same convergence rate expression as Eqn. (16) , with D being replaced as: D = N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H + N − KN − K E H + 2 d. The proof of Theorem 2 is given in Appendix B. We can see that partial clients participation does notfundamentally change the behavior of FL in the presence of uplink and downlink communication noises,and the requirement for SNR control remains largely the same.

C. Convergence Analysis for Model Differential Transmission

In this section, we consider the model different transmission (MDT) scheme when the clients uploadingmodel parameters. Since only model differential is transmitted, the receiver must possess a copy of the“base” model to reconstruct the updated model. This precludes using MDT in the downlink for partialclients participation, because participating clients differ from round to round, and a newly participatingclient does not have the “base” model of the previous round to reconstruct the new global model. Wethus only focus on MDT in the uplink and MT in the downlink with partial clients participation.

Theorem 3.

Let κ , γ and η t be the same as in Theorem 1, and the effective noise follows Eqn. (17) .Adopt a SNR control policy that maintains a constant uplink SNR at each client over t : SNR

S,MDT t,k = ν ∼ O (1) , (20) and scales the effective downlink noise power at each client over t such that: ¯ ζ t ≤ µ N ( γ + t )( γ + t −

2) + K (cid:0) ν (cid:1) ( γ + t ) ∼ O (cid:18) t (cid:19) . (21) Then, under Assumption 1, the convergence of F ED A VG with non-IID datasets and partial clientsparticipation for uplink MDT and downlink MT has the same convergence rate expression as Eqn. (16) ,with D being replaced as: D = N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H + N − KN − K E H + 4 E Kν H + d. The complete proof of Theorem 3 can be found in Appendix C. It is instrumental to note that unlikedirect model transmission, only transmitting model differentials in the uplink allows us to remove thecorresponding SNR scaling requirement. Instead, one can keep a constant

SNR in uplink throughout theentire FL process. Intuitively, this is because the “scaling” already takes place in the model differential d kt , which is the difference between the updated local model at client k after E epochs of training andthe starting local model. As FL gradually converges, this differential becomes smaller. Thus, by keepinga constant communication SNR, we essentially scales down the effective noise power at the server.V. C OMMUNICATION D ESIGN E XAMPLES FOR FL IN N OISY C HANNELS

A. Design Example I: Transmit Power Control for Analog Aggregation

An immediate engineering question following the theoretical analysis is how we can realize the effectivenoise power (either O (cid:0) /t (cid:1) or constant) in the theorems. One natural approach is transmit power control ,which has the ﬂexibility of controlling the average receive SNR (and thus the effective noise power) whilesatisfying the power constraint. In this work, we design a power control policy for the analog aggregationFL framework in [13], as an example to demonstrate the system design for FL tasks in the presence ofcommunication noise. The analog aggregation method in [13].

Consider a communication system where several narrowbandorthogonal channels (e.g. sub-carriers in OFDM, time slots in TDMA, or eigenchannels in MIMO) areshared by K random selected clients in an uplink model update phase of a communication round inF ED A VG . Each element in the transmitted model w ∈ R d is allocated and transmitted in a narrowbandchannel and aggregated automatically over the air [13]. Denote the received signal of each element i = 1 , · · · , d in the t -th communication round as y t,i = 1 K (cid:88) k ∈S t r − α/ t,k h t,k,i p t,k,i w t,k,i + n t,i ∀ k ∈ S t , where r − α/ t,k and h t,k,i ∈ CN ∼ (0 , denote the large-scale and small-scale fading coefﬁcients of thechannel, respectively, n t,i ∈ CN ∼ (0 , is the IID additive Gaussian white noise, and p t,k,i denotes thetransmit power determined by the power control policy. We assume perfect channel state information at the transmitters (CSIT). Due to the aggregation requirement of federated learning, the channel inversionrule is used in [13], which leads to the following instantaneous transmit power of user k at time t formodel weight element i : p t,k,i = (cid:112) ρ UL t r − α/ t,k h t,k,i , (22)where ρ t is a scalar that denotes the uplink average transmit power. Hence, the received SNR of theglobal model can be written as SNR Gt = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d (cid:88) i =1 √ ρ UL t K (cid:80) k ∈S t w t,k,i n t,i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = ρ UL t E || (cid:80) k ∈S t w kt || dK . (23) Transmit power control.

The original analog aggregation framework in [13] assumes that ρ UL t is aconstant over time t . However, our theoretical analysis in Section IV suggests that this can be improved.Speciﬁcally, if we take partial clients participation and MT as an example, and further assume IID weightelements, we have ρ UL t = Kσ t ≥ µ ( γ + t − K ∼ O ( t ) , (24)by plugging in Theorem 2, which implies that ρ t should be increased at the rate O ( t ) in the uplinkto ensure the convergence of F ED A VG . Similar policy can be derived for MDT and/or full clientsparticipation, by invoking the corresponding theorems.In the downlink case, when the server broadcasts the global model to K randomly selected clients,the receive signal of the i -th element for the n -th user in the t -th communication round is y t,n,i = r − α/ t,n h t,n,i (cid:113) ρ DL t w t,i + e t,n,i ∀ n = 1 · · · K, where e t,n,i ∈ CN ∼ (0 , is the IID additive Gaussian white noise, and ρ DL t is the transmitted powerat the server. The downlink SNR for the n -th user is SNR

Lt,n,i = r − αt,n ρ DL t . (25)Instead of keeping ρ DL t as a constant, we derive the following policy based on Theorem 3 to guaranteethe convergence of F ED A VG : ρ DL t ≥ r αt,k µ ( γ + t )( γ + t − N ∼ O ( t ) . (26)Finally, by applying the power control policy deﬁned in Eqns. (24) and (26), FL tasks are able to achievebetter performances under the same energy budget. This is also numerically validated in the experiment. Remarks.

We note that the proposed transmit power control only changes the average transmit powerat learning rounds. In fact, this method can be used in conjunction with a faster power control, such asthe channel inversion power in Eqn. (22) or any other methods that handle the fast fading componentor interference, to determine the instantaneous transmit power of the sender. One minor note is that thepathloss component appears in Eqn. (26) but not in (24). This is due to the broadcast nature of download.For upload, the pathloss is absorbed in the channel inversion expression (22).

B. Design Example II: Receiver Diversity Combining for Analog Aggregation

Another technique that can beneﬁt from our theoretical results is to control the diversity order of areceiver combining scheme, such as using multiple receive antennas or multiple time or frequency slots.Essentially we are leveraging the repeated transmissions to reduce the effective noise power via diversitycombining, and by only activating sufﬁcient diversity branches as we progress over the learning rounds,resources can be more efﬁciently utilized.

Uplink diversity requirement.

We assume the updated local model is independently received P times(over time, frequency, space, or some combination of them) in the t -th communication round. Reusingthe notations and the channel inversion rule in (22), the P t received signals for the i -th element can bedenoted as y t,i,p = 1 K K (cid:88) k =1 √ ρ t,p w t,k,i + n t,i,p ∀ k ∈ S t , ∀ p = 1 · · · P t . For simplicity, we write ρ t,p = ρ as the average transmit power. The received SNR of the global modelafter the diversity combining can be written as SNR Gt = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d (cid:88) i =1 (cid:80) Pp =1 √ ρ t,p K (cid:80) k ∈S t w t,k,i (cid:80) Pp =1 n t,i,p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = P t ρ E || (cid:80) k ∈S t w kt || dK . Compared with the SNR of the power control policy in (23), we can derive the diversity requirement as P t = (cid:24) ρ UL t ρ (cid:25) , (27)where (cid:100) a (cid:101) denotes the ceiling operation on a . Downlink diversity requirement.

The server broadcasts the global weight for Q t times (again it canbe over time, frequency, space, or some combination of them) in the t -th communication round and eachclient combines the multiple independent copies of the received signals to achieve a higher SNR (i.e.,lower effective noise power). The downlink signal can be written as y t,k,i,q = r − α/ t,n,i h t,n,i √ ρ t,q w t,i + e t,k,i,q ∀ n = 1 · · · N, ∀ q = 1 · · · Q t , where ρ t,q = ρ is the transmit power constraint at the server. The downlink SNR for the n -th user is SNR

Lt,n = r − αt,n Q t ρ . Similarly, compared with the local SNR in (25), we can derive the diversity requirement as Q t = (cid:24) ρ DL t ρ (cid:25) . (28)By applying the combining rules in Eqns. (27) and (28), we have the complete design for receiver diversitythat can guarantee the convergence of FL at rate O (1 /T ) , under the transmit power constraints at bothclients and server. Remarks.

Receiver diversity combining is not as ﬂexible as power control, because it can only achieve discrete effective noise power levels. This is also conﬁrmed in the experiments. However, it can beuseful in situations where adjusting the average transmit power is not feasible, e.g., no change at thetransmitter is allowed. In addition, one can combine the transmit power control in Section V-A with thereceiver diversity combining in Section V-B in a straightforward manner. We also note that there areother methods, such as increasing the precision of Analog-to-Digital Converters (ADC), to implementthe SNR control policy. The general design principles in Theorems 1 to 3 can be similarly realized.VI. E

XPERIMENT R ESULTS

A. Experiment Setup

We consider a communication system with noisy uplink and downlink channels to support various FLtasks. For simplicity, we assume that every channel use has the same noise level. In each communicationround of FL, the updated (locally or globally) ML model (or model differential when applicable)is transmitted over the noisy channel as described in Section III-B. Suppose that there are T totalcommunication rounds and both the uplink and downlink total energy budget is P = (cid:80) Tt =1 P t , where P t is the transmission power of the t -th round. We consider the following four schemes in the experiments.1) Noise free.

This is the ideal case where there is no noise in either uplink or downlink channels, i.e.,the accurate model parameters are perfectly received at the server and clients. This represents thebest-case performance.2)

Equal power allocation.

In each communication round, the uplink and downlink average transmissionpower is the same, i.e., P t = P/T, ∀ t = 1 , · · · , T . The received SNR of the model parameters is setas dB in the experiments. This represents the current state of the art in [13].3) O ( t ) -increased power control policy. Transmit power increases at the rate of O ( t ) with thecommunication rounds, i.e., the received SNR of the model parameters is increased and the effective noise of the signal is decreased with the progress of FL. With the total budget P , it is straightforwardto compute that P t = P t / ( T ( T + 1)(2 T + 1)) , ∀ t = 1 , · · · , T. O ( t ) -increased diversity combining policy. The transmit power in both downlink and uplink remainsthe same. However, the ﬁnal models at the server and clients of each communication round areobtained by multiple repeated transmissions and the subsequent combining. The number of the repeatedtransmissions increases at the rate of O ( t ) . For simple discretization, we use , , , and ordersof diversity combining in both uplink and downlink model transmissions for st to th, th to th, th to th, th to th, and th to th communication round, respectively, of a -roundF ED A VG task. Note that the total transmit power budget in this scheme remains the same as theprevious two methods.We use the standard image classiﬁcation and natural language processing FL tasks to evaluate theperformances of these schemes. The following three standard datasets are used in the experiments, whichare commonly accepted as the benchmark tasks to evaluate the performance of FL.1) MNIST.

The training sets contains examples. For the full clients participation case, the trainingsets are evenly distributed over N = K = 10 clients. For the partial clients participation case, thetraining sets are evenly partitioned over N = 2000 clients each containing 30 examples, and we set K = 20 per round ( of total users). For the IID case, the data is shufﬂed and randomly assigned toeach client, while for the non-IID case the data is sorted by labels and each client is then randomlyassigned with 1 or 2 labels. The CNN model has two × convolution layers, a fully connected layerwith 512 units and ReLU activation, and a ﬁnal output layer with softmax. The ﬁrst convolution layerhas 32 channels while the second one has 64 channels, and both are followed by × max pooling.The following parameters are used for training: local batch size BS = 5 , the number of local epochs E = 1 , and learning rate η = 0 . .2) CIFAR-10.

We set N = K = 10 for the full clients participation case while N = 100 and K = 10 for the partial clients participation case. We train a CNN model with two × convolution layers(both with 64 channels), two fully connected layers (384 and 192 units respectively) with ReLU activation and a ﬁnal output layer with softmax. The two convolution layers are both followed by × max pooling and a local response norm layer. The training parameters are: (a) IID: BS = 50 , E = 5 , learning rate initially sets to η = 0 . and decays every 10 rounds with rate 0.99; (b) non-IID: BS = 100 , E = 1 , η = 0 . and decay every round with rate 0.992.3) Shakespeare.

This dataset is built from

The Complete Works of William Shakespeare and each speakingrole is viewed as a client. Hence, the dataset is naturally unbalanced and non-IID since the number Rounds T es t A cc u r ay CIFAR-10 IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss CIFAR-10 IIDNoise freeEqual powerO(t )-increased power Rounds T es t A cc u r ay CIFAR-10 non-IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss CIFAR-10 non-IIDNoise freeEqual powerO(t )-increased power Fig. 2. Comparing the performance of transmit power control to the baselines with full clients participation, model transmission,and both IID (left two) and non-IID (right two) FL on the CIFAR-10 dataset. of lines and speaking habits of each role vary signiﬁcantly. There are totally roles in the dataset[44]. We randomly pick of them and build a dataset with training examples and test examples. We also construct an IID dataset by shufﬂing the data and redistribute evenly to roles and set K = 10 . The ML task is the next-character prediction, and we use a classiﬁer with an8D embedding layer, two LSTM layers (each with hidden units) and a softmax output layer with nodes. The training parameters are: BS = 20 , E = 1 , learning rate initially sets to η = 0 . anddecays every rounds with rate . .We compare the test accuracies and training losses as functions of the communication rounds for all theaforementioned conﬁgurations. All of the reported results are obtained by averaging over 5 independentruns. We also report the ﬁnal test accuracy, which is averaged over the last 10 rounds, as the performanceof the ﬁnal global model. B. Experiment Results for Transmit Power Control

We primarily evaluate the O ( t ) -increased transmit power control and compare with the noise free andequal power allocation baselines. The focus of the experiment is on partial clients participation underboth MT and MDT, but we ﬁrst report the results for full clients participation in CIFAR-10. Full clients participation.

We see from Fig. 2 that that under the same total power budget, the O ( t ) power control policy performs better than the equal power allocation scheme and is very close to the noise-free ideal case. Speciﬁcally, O ( t ) power control policy achieves . and . ﬁnal test accuracy inIID and non-IID data partitions on CIFAR-10, which is . and . better than that of the equal powerallocation scheme. Note that the training loss (test accuracy) of equal power allocation scheme increases(decreases) during the late rounds ( th to th) in the non-IID case, implying that a non-increasingSNR may occur deterioration in the convergence of FL for more difﬁcult ML tasks. Similar results arealso observed for MNIST and Shakespeare in the full clients participation case. Partial clients participation.

The performance comparisons of the the three schemes on MNIST,CIFAR-10 and Shakespeare datasets in both IID and non-IID conﬁgurations and MT are reported in Rounds T es t A cc u r ay MNIST IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss MNIST IIDNoise freeEqual powerO(t )-increased power Rounds T es t A cc u r ay MNIST non-IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss MNIST non-IIDNoise freeEqual powerO(t )-increased power Fig. 3. Comparing the performance of transmit power control to the baselines with partial clients participation, modeltransmission, and both IID (left two) and non-IID (right two) FL on the MNIST dataset.

Figs. 3, 4, and 5, respectively. Their ﬁnal model accuracies (after T rounds of FL are complete) are alsosummarized in Table I. First, we see from Fig. 3 that the proposed O ( t ) -increased power allocationscheme achieves higher test accuracy and lower train loss than the equal power allocation scheme underthe same energy budget on MNIST. In particular, O ( t ) -increased power allocation scheme achieves . higher test accuracy than that of equal power allocation scheme in both IID and non-IID datapartitions, respectively. It may seem that the gain is insigniﬁcant, but the reason is mostly due to thatMNIST classiﬁcation is a very simple task. In fact, the gain of power control is much more notableunder the challenging CIFAR-10 and Shakespeare tasks as shown in Figs. 4 and Fig. 5, respectively.Compared with the equal power allocation scheme, which achieves . and . of the ideal (noisefree) test accuracy in IID and non-IID data partitions under CIFAR-10 dataset respectively, the proposed O ( t ) -increased power allocation achieves . (IID) and . (non-IID) of the ideal (noise free) testaccuracy respectively after T = 500 communication rounds. Similarly, under Shakespeare dataset, the Rounds T es t A cc u r ay Shakespeare IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss Shakespeare IIDNoise freeEqual powerO(t )-increased power Rounds T es t A cc u r ay Shakespeare non-IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss Shakespeare non-IIDNoise freeEqual powerO(t )-increased power Fig. 5. Comparing the performance of transmit power control to the baselines with partial clients participation, modeltransmission, and both IID (left two) and non-IID (right two) FL on the Shakespeare dataset. Rounds T es t A cc u r ay MNIST IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss MNIST IIDNoise freeEqual powerO(t )-increased power Rounds T es t A cc u r ay MNIST non-IIDNoise freeEqual powerO(t )-increased power Rounds T es t A cc u r ay MNIST non-IIDNoise freeEqual powerO(t )-increased power Fig. 6. Comparing the performance of transmit power control to the baselines with partial clients participation, model differentialtransmission, and both IID (left two) and non-IID (right two) FL on the MNIST dataset.

Rounds T es t A cc u r ay CIFAR-10 IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss CIFAR-10 IIDNoise freeEqual powerO(t )-increased power Rounds T es t A cc u r ay CIFAR-10 IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss CIFAR-10 non-IIDNoise freeEqual powerO(t )-increased power Fig. 7. Comparing the performance of transmit power control to the baselines with partial clients participation, model differentialtransmission, and both IID (left two) and non-IID (right two) FL on the CIFAR-10 dataset. equal power allocation scheme achieves . (IID) and . (non-IID) of the ideal (noise free) testaccuracy, while the proposed method improves . and . , respectively. Both tasks have signiﬁcantaccuracy improvement due to the O ( t ) power control. MDT.

We next present the experiment results of model differential transmission. Note that, by applyingMDT, the uplink transmission power of the proposed scheme remains constant (recall that SNR is setas 10dB) while the downlink transmission power still increases at the rate of O ( t ) . Figs. 6, 7 and 8illustrate the test accuracies and training losses with MDT under MNIST, CIFAR-10 and Shakespearedatasets and the ﬁnal model accuracies of the three schemes are summarized in Table II. We see thatthe proposed power control policy achieves . ( . ), . ( . ) and ( . ) of theideal test accuracy in IID (non-IID) data setting under MNIST, CIFAR-10 and Shakespeare datasets,respectively, which signiﬁcantly outperforms the baseline equal power allocation scheme. Rounds T es t A cc u r ay Shakespeare IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss Shakespeare IIDNoise freeEqual powerO(t )-increased power Rounds T e s t A cc u r a y Shakespeare non-IIDNoise freeEqual powerO(t )-increased power Rounds T r a i n i ng Lo ss Shakespeare non-IIDNoise freeEqual powerO(t )-increased power Fig. 8. Comparing the performance of transmit power control to the baselines with partial clients participation, model differentialtransmission, and both IID (left two) and non-IID (right two) FL on the Shakespeare dataset. TABLE I P ERFORMANCE S UMMARY OF M ODEL T RANSMISSION (MT)

Dataset Scheme Accuracy Percentage* Accuracy Percentage*IID non-IIDMNIST Noise free 99.3% 100% 99.1% 100%Increased power 99.1% 99.8% 99.0% 99.9%Equal power 98.5% 99.2% 98.4% 99.3%CIFAR-10 Noise free 79.5% 100% 54.3% 100%Increased power 78.9% 99.2% 52.1% 95.9%Equal power 71.7 % 90.2% 44.3% 81.6%Shakespeare Noise free 57.8% 100% 56.8% 100%Increased power 57.8% 100% 56.4% 99.3%Equal power 52.9 % 91.5% 54.4% 95.8%

TABLE II P ERFORMANCE S UMMARY OF M ODEL D IFFERENTIAL T RANSMISSION (MDT)

Dataset Scheme Accuracy Percentage* Accuracy Percentage*IID non-IIDMNIST Noise free 99.3% 100% 99.1% 100%Increased power 99.0% 99.7% 98.8% 99.7%Equal power 96.7% 97.4% 97.5% 98.4%CIFAR-10 Noise free 79.5% 100% 54.3% 100%Increased power 78.9% 99.2% 53.2% 98.0%Equal power 73.9 % 93.0% 47.7% 87.8%Shakespeare Noise free 57.8% 100% 56.8% 100%Increased power 57.8% 100% 56.2% 98.9%Equal power 53.3 % 92.2% 54.3% 95.6%

Rounds T es t A cc u r ay CIFAR-10 IIDNoise freeEqual powerO(t )-increased combining Rounds T r a i n Lo ss CIFAR-10 IIDNoise freeEqual powerO(t )-increased combining Rounds T es t A cc u r ay CIFAR-10 non-IIDNoise freeEqual powerO(t )-increased combining Rounds T r a i n i ng Lo ss CIFAR-10 non-IIDNoise freeEqual powerO(t )-increased combining Fig. 9. Comparing the performance of diversity combining to the baselines with partial clients participation, model differentialtransmission, and both IID (left two) and non-IID (right two) FL on the CIFAR-10 dataset.

C. Experiment Results for Receiver Diversity Combining

We next evaluate the performance of diversity combining. Fig. 9 captures the test accuracies andtraining losses of diversity combining together with noise free and equal power allocation schemes.Although diversity combining is less ﬂexible than the (continuous) transmit power control policy, we cansee that it still outperforms the baseline method and approaches the noise-free ideal case. We notice thatthe training losses of diversity combining are larger than those of the equal power allocation scheme atthe beginning stage of convergence, but as the diversity branches increase, the training losses eventuallyreduce and the model converges to a better global one. Particularly, diversity combining achieves . and . test accuracies for IID and non-IID data partitions, which is . and . better than theequal power allocation scheme. VII. C ONCLUSION

In this paper, we have investigated federated learning over noisy channels, where a F ED A VG pipelinewith both uplink and downlink communication noises was studied. By theoretically analyzing the con-vergence of FL under noisy communications in both directions, we have proved that the same O (1 /T ) convergence rate of F ED A VG under perfect communications can be maintained if the uplink and downlinkSNRs are controlled as O ( t ) over noisy channels for direct model transmission, and constant for modeldifferential transmission. We have showcased two commonly used communication techniques – transmit power control and diversity combining – to implement these theoretical results. Extensive experimentalresults have corroborated the theoretical analysis and demonstrated the performance superiority of theadvanced transmit power control and diversity combining schemes over baseline methods under the sametotal energy budget. A PPENDIX AP ROOF OF T HEOREM A. Preliminaries

With a slight abuse of notation, we change the timeline to be with respect to the overall SGD iterationtime steps instead of the communication rounds, i.e., t = 1 , · · · , E (cid:124) (cid:123)(cid:122) (cid:125) round 1 , E + 1 , · · · , E (cid:124) (cid:123)(cid:122) (cid:125) round 2 , · · · , · · · , ( T − E + 1 , · · · , T E (cid:124) (cid:123)(cid:122) (cid:125) round T . NOte that the (noisy) global model w t is only accessible at the clients for speciﬁc t ∈ I E , where I E = { nE | n = 1 , , . . . } , i.e., the time steps for communication. The notations for η t , σ t and ζ t aresimilarly adjusted to this extended timeline, but their values remain the same inside the same round.For client k , it trains the model with mini-batch SGD and we denote the one-step model update as: v kt +1 = w kt − η t ∇ F k ( w kt , ξ kt ) . (29)If t + 1 / ∈ I E , the next-step result is w kt +1 = v kt +1 since no global aggregation takes place. If t + 1 ∈ I E ,the server would receive the noisy weights ˜w kt +1 from all client k ∈ [ N ] . The global model is updatedwith w t +1 = N (cid:80) k ∈ [ N ] ˜w kt +1 and then broadcast to all clients, who only receive the noisy global models ˆw kt +1 = w t + e kt +1 due to the downlink communication errors. They then start the next local trainingperiod. We deﬁne the following variables that are used in the subsequent analysis: u kt +1 (cid:44)  v kt +1 if t + 1 / ∈ I E , N (cid:80) i ∈ [ N ] v it +1 if t + 1 ∈ I E ; p kt +1 (cid:44)  v kt +1 if t + 1 / ∈ I E , u kt +1 + N (cid:80) i ∈ [ N ] n it +1 if t + 1 ∈ I E . w kt +1 (cid:44)  v kt +1 if t + 1 / ∈ I E , p kt +1 + e kt +1 if t + 1 ∈ I E . We deﬁne the following virtual sequences : v t = 1 N N (cid:88) k =1 v kt , u t = 1 N N (cid:88) k =1 u kt p t = 1 N N (cid:88) k =1 p kt , and w t = 1 N N (cid:88) k =1 w kt (30)to facilitate the analysis. We also deﬁne g t = N (cid:80) Nk =1 ∇ F k ( w kt ) and g t = N (cid:80) Nk =1 ∇ F k ( w kt , ξ kt ) forconvenience. Therefore, v t +1 = w t − η t g t and E [ g t ] = g t . We can also write the speciﬁc formulationsof these virtual sequences when t + 1 ∈ I E as follows. u t +1 = 1 N N (cid:88) k =1 u kt +1 = 1 N (cid:88) i ∈ [ N ] v it +1 , (31) p t +1 = 1 N N (cid:88) k =1 p kt +1 = 1 N N (cid:88) k =1  u kt +1 + 1 N (cid:88) i ∈ [ N ] n it +1  = u t +1 + 1 N (cid:88) i ∈ [ N ] n it +1 , (32) w t +1 = 1 N N (cid:88) k =1 w kt +1 = 1 N N (cid:88) k =1 (cid:104) p kt +1 + e kt +1 (cid:105) = p t +1 + 1 N N (cid:88) k =1 e kt +1 . (33)Note that for t + 1 / ∈ I E , all these virtual sequences are the same. In addition, the global model (at theserver) p t +1 is meaningful only at t + 1 ∈ I E . We emphasize that when t + 1 ∈ I E , Eqns. (32) and (10)indicate that p t +1 = w t +1 . Thus it is sufﬁcient to analyze the convergence of (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) to proveTheorem 1. B. Lemmas and proofs

Lemma 1.

Let Assumptions 1-1) to 4) hold, η t is non-increasing, and η t ≤ η t + E for all t ≥ . If η t ≤ L , we have E (cid:107) v t +1 − w ∗ (cid:107) ≤ (1 − η t µ ) E (cid:107) w t − w ∗ (cid:107) + η t (cid:16)(cid:80) Nk =1 δ k N + 6 L Γ + 8( E − H (cid:17) . Lemma 1 establishes a bound for the one-step SGD. This result only concerns the local model updateand is not impacted by the noisy communication. The proof can be constructed following the same stepsas in [7], [23].

Lemma 2.

We have E (cid:2) p t +1 (cid:3) = u t +1 , E (cid:13)(cid:13) u t +1 − p t +1 (cid:13)(cid:13) = dσ t +1 N , (34) and E [ w t +1 ] = p t +1 , E (cid:13)(cid:13) w t +1 − p t +1 (cid:13)(cid:13) = dζ t +1 N , (35) for t + 1 ∈ I E , where σ t +1 (cid:44) (cid:80) k ∈ [ N ] σ t +1 ,k and ζ t +1 (cid:44) (cid:80) k ∈ [ N ] ζ t +1 ,k . Proof.

For t + 1 ∈ I E , according to (32), we have E (cid:2) p t +1 − u t +1 (cid:3) = K (cid:80) k ∈ [ N ] E [ n kt +1 ] = 0 . and E (cid:13)(cid:13) p t +1 − u t +1 (cid:13)(cid:13) = 1 N E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k ∈ [ N ] n kt +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 N (cid:88) k ∈ [ N ] E (cid:13)(cid:13)(cid:13) n kt +1 (cid:13)(cid:13)(cid:13) = (cid:80) k ∈ [ N ] dσ t +1 ,k N = dσ t +1 N because { n kt +1 , ∀ k } are independent variables. Similarly, according to (33), we have E (cid:2) w t +1 − p t +1 (cid:3) = N (cid:80) k ∈ [ N ] E [ e kt +1 ] = 0 and E (cid:13)(cid:13) w t +1 − p t +1 (cid:13)(cid:13) = 1 N E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k ∈ [ N ] e kt +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 N (cid:88) k ∈ [ N ] E (cid:13)(cid:13)(cid:13) e kt +1 (cid:13)(cid:13)(cid:13) = (cid:80) k ∈ [ N ] dζ t +1 ,k N = dζ t +1 N . C. Proof of Theorem

We need to consider four cases for the analysis of the convergence of E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) .1) If t / ∈ I E and t + 1 / ∈ I E , w t = p t and v t +1 = p t +1 . Using Lemma 1, we have: E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) = E (cid:107) v t +1 − w ∗ (cid:107) ≤ (1 − η t µ ) E (cid:107) p t − w ∗ (cid:107) + η t (cid:34) N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H (cid:35) . (36)2) If t ∈ I E and t + 1 / ∈ I E , we still have v t +1 = p t +1 . Note that we have w t = p t + N (cid:80) Nk =1 e kt .Then: (cid:107) w t − w ∗ (cid:107) = (cid:107) w t − p t + p t − w ∗ (cid:107) = (cid:107) p t − w ∗ (cid:107) + (cid:107) w t − p t (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) A + 2 (cid:104) w t − p t , p t − w ∗ (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) A . We ﬁrst note that the expectation of A over the noise randomness is zero since we have E [ w t − p t ] = (from Eqn. (35)). Second, the expectation of A can be bounded using Lemma 2. We then have E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) = E (cid:107) v t +1 − w ∗ (cid:107) ≤ (1 − η t µ ) E (cid:107) p t − w ∗ (cid:107) + (1 − η t µ ) E (cid:107) w t − p t (cid:107) + η t (cid:20) δ k N + 6 L Γ + 8( E − H (cid:21) ≤ (1 − η t µ ) E (cid:107) p t − w ∗ (cid:107) + (1 − η t µ ) dζ t N + η t (cid:34) N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H (cid:35) . (37)3) If t / ∈ I E and t + 1 ∈ I E , then we still have w t = p t . For t + 1 , we need to evaluate the convergence of E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) . We have (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) = (cid:13)(cid:13) p t +1 − u t +1 + u t +1 − w ∗ (cid:13)(cid:13) = (cid:13)(cid:13) p t +1 − u t +1 (cid:13)(cid:13) (cid:124) (cid:123)(cid:122) (cid:125) B + (cid:107) u t +1 − w ∗ (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) B + 2 (cid:10) p t +1 − u t +1 , u t +1 − w ∗ (cid:11)(cid:124) (cid:123)(cid:122) (cid:125) B . (38)We ﬁrst note that the expectation of B over the noise is zero since we have E (cid:2) u t +1 − p t +1 (cid:3) = (fromEqn. (34)). Second, the expectation of B can be bounded using Lemma 2. Noticing that u t +1 = v t +1 for B and applying Lemma 1, we have: E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) ≤ E (cid:107) v t +1 − w ∗ (cid:107) + dσ t +1 K ≤ (1 − η t µ ) E (cid:107) p t − w ∗ (cid:107) + dσ t +1 N + η t (cid:34) N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H (cid:35) . (39)4) If t ∈ I E and t + 1 ∈ I E , v t +1 (cid:54) = p t +1 and w t (cid:54) = p t . (Note that this is possible only for E = 1 .)Combining the results from the previous two cases, we have E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) ≤ (1 − η t µ ) E (cid:107) p t − w ∗ (cid:107) +(1 − η t µ ) dζ t N + dσ t +1 N + η t (cid:34) N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H (cid:35) . (40)Finally, we have that inequality (40) holds for all four cases. Denote ∆ t = E (cid:107) p t − w ∗ (cid:107) . If we setthe effective noise power σ t +1 and ζ t such that σ t +1 ≤ N η t and ζ t ≤ N η t − η t µ , we always have ∆ t +1 ≤ (1 − η t µ )∆ t + η t D, where D = (cid:80) Nk =1 δ k N + 6 L Γ + 8( E − H + 2 d . We decay the learning rate as η t = βt + γ for some β ≥ µ and γ ≥ such that η ≤ min { µ , L } = L and η t ≤ η t + E . Now we prove that ∆ t ≤ vγ + t where v = max { β Dβµ − , ( γ + 1)∆ } by induction. First, the deﬁnition of v ensures that it holds for t = 0 .Assume the conclusion holds for some t > . It then follows that ∆ t +1 ≤ (1 − ηµ )∆ t + η t D = (cid:18) − βµt + γ (cid:19) vt + γ + β D ( t + γ ) = t + γ − t + γ ) v + (cid:20) β D ( t + γ ) − µβ − t + γ ) v (cid:21) ≤ vt + γ + 1 . Specially, if we choose β = µ , γ = max { Lµ − , E } and denote κ = Lµ , then η t = µ γ + t . Using max { a, b } ≤ a + b , we have v ≤ Dµ + ( γ + 1)∆ ≤ Dµ + (8 κ + E ) (cid:107) w − w ∗ (cid:107) . Therefore, ∆ t ≤ vγ + t = γ + t (cid:104) Dµ + (8 κ + E ) (cid:107) w − w ∗ (cid:107) (cid:105) . Setting t = T concludes the proof. A PPENDIX BP ROOF OF T HEOREM all clients receive the noisy downlink broadcast of thelatest global model, and they all participate in the subsequent local model update phase. However, onlythe selected clients in S t +1 upload their updated local model to the server via the noisy uplink channel.It is clear that this “virtual” FL is equivalent to the original process in terms of the convergence – clientsthat are not selected do not contribute to the global model aggregation. This seemingly redundant process,however, circumvents the difﬁculty due to partial clients participation as can be seen in the analysis.Before presenting the proof, we ﬁrst elaborate on some necessary changes of notation. The notationdeﬁned in Section A-A can be largely reused, with the notable distinction that now we have to separatethe cases for K and for N . The variables of u kt +1 and p kt +1 are now deﬁned as: u kt +1 (cid:44)  v kt +1 if t + 1 / ∈ I E , K (cid:80) i ∈ S t v it +1 if t + 1 ∈ I E ; p kt +1 (cid:44)  v kt +1 if t + 1 / ∈ I E , u kt +1 + K (cid:80) i ∈S t n it +1 if t + 1 ∈ I E . Note that Lemma 2 still holds with the following update: E (cid:13)(cid:13) u t +1 − p t +1 (cid:13)(cid:13) = d ¯ σ t +1 K , E (cid:13)(cid:13) w t +1 − p t +1 (cid:13)(cid:13) = d ¯ ζ t +1 N . In addition, we need the following lemma, whose proof is omitted due to space limitation. Lemma 3.

Let Assumption 1-4) hold. Also assume that η t ≤ η t + E for all t ≥ . Then for t + 1 ∈ I E ,we have E [ u t +1 ] = v t +1 , and E (cid:107) v t +1 − u t +1 (cid:107) ≤ N − KN − K η t E H . (41)We can now similarly analyze the four cases as in Section A-C. Cases 1) and 2) remain the same asbefore. For Case 3) we need to consider t / ∈ I E and t + 1 ∈ I E . Note that Eqn. (38) still holds, but weneed to re-evaluate the expectation of B because of partial clients participation. We have: (cid:107) u t +1 − w ∗ (cid:107) = (cid:107) u t +1 − v t +1 + v t +1 − w ∗ (cid:107) = (cid:107) u t +1 − v t +1 (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) C + (cid:107) v t +1 − w ∗ (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) C + 2 (cid:104) u t +1 − v t +1 , v t +1 − w ∗ (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) C . (42)When the expectation is taken over the random clients sampling, the expectation of C is zero since wehave E [ u t +1 − v t +1 ] = (from Eqn. (41)). The expectation of C can be bounded using Lemma 3. Therefore we have E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) ≤ E (cid:107) v t +1 − w ∗ (cid:107) + d ¯ σ t +1 K + N − KN − K η t E H . Using Lemma 1, wehave E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) ≤ E (cid:107) v t +1 − w ∗ (cid:107) + d ¯ σ t +1 K + N − KN − K η t E H ≤ (1 − η t µ ) E (cid:107) p t − w ∗ (cid:107) + d ¯ σ t +1 K + η t (cid:34) N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H + N − KN − K E H (cid:35) . Case 4) should be similarly updated based on the new result in Case 3). Finally, we have that E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) ≤ (1 − η t µ ) E (cid:107) p t − w ∗ (cid:107) + (1 − η t µ ) d ¯ ζ t N + d ¯ σ t +1 K + η t (cid:34) N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H + N − KN − K E H (cid:35) (43)holds for all cases. If we set ¯ σ t +1 and ¯ ζ t such that ¯ σ t +1 ≤ Kη t and ¯ ζ t ≤ N η t − η t µ , the remaining prooffollows the same way as in Section A-C. A PPENDIX CP ROOF OF T HEOREM t + 1 ∈ I E , the global aggregation is given in Eqn. (11).Similar to Appendix A and B, we expand the timeline to be with respect to the overall SGD iterationtime steps, and deﬁne the following variables to facilitate the proof. v kt +1 (cid:44) w kt − η t ∇ F k ( w kt , ξ kt ) , d kt +1 (cid:44) v kt +1 − w kt +1 − E . Furthermore, when t + 1 / ∈ I E we deﬁne u kt +1 = p kt +1 = w kt +1 (cid:44) v kt +1 , andwhen t + 1 ∈ I E we deﬁne u kt +1 (cid:44) K (cid:80) i ∈ S t v it +1 , p kt +1 (cid:44) w t +1 − E + K (cid:80) i ∈S t [ d it +1 + n it +1 ] , and w kt +1 (cid:44) p kt +1 + e kt +1 . The virtual sequences v t , u t , p t and w t remain the same as Eqn. (30). g t and g t are also similarly deﬁned. Note that the global model at the server is the same as p t , i.e., w t +1 = p t +1 .We ﬁrst establish the follow in lemma, which is instrumental in the proof of Theorem 3. Lemma 4.

Let Assumption 1-4) hold. Assume that η t ≤ η t + E for all t ≥ , and further assume thatthe uplink communication adopts a constant SNR control policy: SNR

S,MDT t,k = ν . Then, for t + 1 ∈ I E ,we have: E (cid:2) p t +1 (cid:3) = u t +1 , E (cid:13)(cid:13) u t +1 − p t +1 (cid:13)(cid:13) ≤ (cid:0) ν (cid:1) dK ¯ ζ t +1 − E + E Kν η t H .Proof. Note that if t + 1 ∈ I E , so does t + 1 − E . Insert d kt +1 = v kt +1 − w kt +1 − E into p kt +1 , we have E (cid:2) p t +1 (cid:3) = u t +1 + K E (cid:2)(cid:80) k ∈S t n kt +1 (cid:3) − K E (cid:2)(cid:80) k ∈S t e kt +1 − E (cid:3) = E [ u t +1 ] . As for the variance, we have E (cid:13)(cid:13) u t +1 − p t +1 (cid:13)(cid:13) = 1 K E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k ∈S t n kt +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 1 K E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k ∈S t e kt +1 − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 K ν E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k ∈S t d kt +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + d ¯ ζ t +1 − E K (44) where the last equality comes from the constant uplink SNR control, Eqn. (9), and the assumption thateach client has the same downlink noise power ¯ ζ t , ∀ k ∈ [ N ] . We further have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k ∈S t d kt +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k ∈S t ( v kt +1 − w t +1 − E ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + dK ¯ ζ t +1 − E ≤ E S t  (cid:88) k ∈S t E SG (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) τ = t +1 − E η τ ∇ F k ( w kτ , ξ kτ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  + dK ¯ ζ t +1 − E ≤ E Kη t H + dK ¯ ζ t +1 − E (45)using the Cauchy-Schwarz inequality, Assumption 1-4), and η t +1 − E < η t − E ≤ η t . Plugging Eqn. (45)back to Eqn. (44) gives E (cid:13)(cid:13) u t +1 − p t +1 (cid:13)(cid:13) = 1 K ν E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:88) k ∈S t +1 d kt +1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + d ¯ ζ t +1 − E K ≤ (cid:18) ν (cid:19) dK ¯ ζ t +1 − E + 4 E Kν η t H , which completes the proof.We are now ready to present the proof of Theorem 3, which is similar to that of Theorem 2. Inparticular, the analysis of four cases in Section B still hold, with the only change that Eqn. (43) isupdated to Eqn. (46) below using Lemma 4. E (cid:13)(cid:13) p t +1 − w ∗ (cid:13)(cid:13) ≤ (1 − η t µ ) E (cid:107) p t − w ∗ (cid:107) + (1 − η t µ ) dN ¯ ζ t + (cid:18) ν (cid:19) dK ¯ ζ t + η t (cid:34) N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H + N − KN − K E H + 4 E Kν H (cid:35) . (46)We note that the constant uplink SNR control is already used in Lemma 4 and Eqn. (46). Then, by thedeﬁnition of ∆ t = E (cid:107) p t − w ∗ (cid:107) and controlling the downlink SNR such that ¯ ζ t ≤ NKη t (1 − η t µ ) K + ( ν ) N ,we have ∆ t +1 ≤ (1 − η t µ )∆ t + η t D where D (cid:44) N (cid:88) k =1 δ k N + 6 L Γ + 8( E − H + N − KN − K E H + 4 E Kν H + d. The remaining proof follows the same induction steps as in Appendix A and B.R

EFERENCES [1] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efﬁcient learning of deep networksfrom decentralized data,” in

Proc. AISTATS , Apr. 2017, pp. 1273–1282.[2] J. Konecny, H. B. McMahan, F. X. Yu, P. Richtarik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies forimproving communication efﬁciency,” in

NIPS Workshop on Private Multi-Party Machine Learning , 2016. [3] K. Bonawitz et al. , “Towards federated learning at scale: System design,” in Proc. SysML Conference , 2019, pp. 1–15.[4] P. Kairouz et al. , “Advances and open problems in federated learning,” arXiv preprint arXiv:1912.04977 , 2019.[5] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,”

IEEESignal Process. Mag. , vol. 37, no. 3, pp. 50–60, 2020.[6] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith, “Federated optimization in heterogeneous networks,”in

Proceedings of the 3rd MLSys Conference , 2020.[7] S. Zheng, C. Shen, and X. Chen, “Design and analysis of uplink and downlink communications for federated learning,”

IEEE J. Select. Areas Commun. , 2020, to appear.[8] A. Reisizadeh, A. Mokhtari, H. Hassani, A. Jadbabaie, and R. Pedarsani, “FedPAQ: A communication-efﬁcient federatedlearning method with periodic averaging and quantization,” in

Proc. AISTATS , 2020.[9] Y. Du, S. Yang, and K. Huang, “High-dimensional stochastic gradient quantization for communication-efﬁcient edgelearning,”

IEEE Trans. Signal Processing , vol. 68, pp. 2128–2142, 2020.[10] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning inmobile edge networks: A comprehensive survey,”

IEEE Commun. Surveys Tuts. , 2020.[11] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meetsmachine learning,”

IEEE Commun. Mag. , vol. 58, no. 1, pp. 19–25, 2020.[12] Q. Zeng, Y. Du, K. K. Leung, and K. Huang, “Energy-efﬁcient radio resource allocation for federated edge learning,” arXiv preprint arXiv:1907.06040 , 2019.[13] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,”

IEEE Trans.Wireless Commun. , vol. 19, no. 1, pp. 491–506, 2019.[14] X. Cao, G. Zhu, J. Xu, and S. Cui, “Optimized power control for over-the-air federated edge learning,” arXiv preprintarXiv:2011.05587 , 2020.[15] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federatedlearning over wireless networks,” arXiv preprint arXiv:1909.07972 , 2019.[16] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in

IEEEInternational Conference on Communications (ICC) . IEEE, 2019, pp. 1–7.[17] W. Shi, S. Zhou, and Z. Niu, “Device scheduling with fast convergence for wireless federated learning,” arXiv preprintarXiv:1911.00856 , 2019.[18] J. Xu and H. Wang, “Client selection and bandwidth allocation in wireless federated learning networks: A long-termperspective,”

IEEE Trans. Wireless Commun. , pp. 1–1, 2020.[19] M. M. Amiri, D. Gunduz, S. R. Kulkarni, and H. V. Poor, “Federated learning with quantized global model updates,” arXivpreprint arXiv:2006.10672 , 2020.[20] S. U. Stich, “Local SGD converges fast and communicates little,” in

Proc. ICLR , 2018.[21] J. Wang and G. Joshi, “Cooperative SGD: A uniﬁed framework for the design and analysis of communication-efﬁcientSGD algorithms,” in

ICML Workshop on Coding Theory for Machine Learning , 2019.[22] F. Haddadpour, M. M. Kamani, M. Mahdavi, and V. Cadambe, “Local SGD with periodic averaging: Tighter analysis andadaptive synchronization,” in

Advances in Neural Information Processing Systems , 2019, pp. 11 080–11 092.[23] X. Li, K. Huang, W. Yang, S. Wang, and Z. Zhang, “On the convergence of FedAvg on non-IID data,” in

InternationalConference on Learning Representations , 2020.[24] A. Bhowmick, J. Duchi, J. Freudiger, G. Kapoor, and R. Rogers, “Protection against reconstruction and its applications inprivate federated learning,” arXiv preprint arXiv:1812.00984 , 2018. [25] C. Niu, F. Wu, S. Tang, L. Hua, R. Jia, C. Lv, Z. Wu, and G. Chen, “Secure federated submodel learning,” arXiv preprintarXiv:1911.02254 , 2019.[26] C. Xie, S. Koyejo, and I. Gupta, “Practical distributed learning: Secure machine learning with communication-efﬁcientlocal updates,” arXiv preprint arXiv:1903.06996 , 2019.[27] T. Li, M. Sanjabi, and V. Smith, “Fair resource allocation in federated learning,” arXiv preprint arXiv:1905.10497 , 2019.[28] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resourceconstrained edge computing systems,” IEEE J. Select. Areas Commun. , vol. 37, no. 6, pp. 1205–1221, 2019.[29] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efﬁcient SGD via gradientquantization and encoding,” in

Advances in Neural Information Processing Systems , 2017, pp. 1709–1720.[30] G. Zhu, Y. Du, D. Gunduz, and K. Huang, “One-bit over-the-air aggregation for communication-efﬁcient federated edgelearning: Design and convergence analysis,” arXiv preprint arXiv:2001.05713 , 2020.[31] M. M. Amiri and D. G¨und¨uz, “Federated learning over wireless fading channels,”

IEEE Trans. Wireless Commun. , vol. 19,no. 5, pp. 3546–3557, 2020.[32] X. Mo and J. Xu, “Energy-efﬁcient federated edge learning with joint communication and computation design,” arXivpreprint arXiv:2003.00199 , 2020.[33] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy efﬁcient federated learning over wirelesscommunication networks,” arXiv preprint arXiv:1911.02417 , 2019.[34] H. H. Yang, Z. Liu, T. Q. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,”

IEEETrans. Commun. , vol. 68, no. 1, pp. 317–333, 2020.[35] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Convergence time optimization for federated learning over wireless networks,” arXiv preprint arXiv:2001.07845 , 2020.[36] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,”

IEEE Trans. Wireless Commun. ,vol. 19, no. 3, pp. 2022–2035, 2020.[37] Y. Sun, S. Zhou, and D. G¨und¨uz, “Energy-aware analog aggregation for federated learning with redundant data,” in

IEEEInternational Conference on Communications (ICC) , 2020, pp. 1–7.[38] ——, “Energy-aware analog aggregation for federated learning with redundant data,” in

IEEE International Conference onCommunications (ICC) , 2020, pp. 1–7.[39] S. Caldas, J. Koneˇcny, H. B. McMahan, and A. Talwalkar, “Expanding the reach of federated learning by reducing clientresource requirements,” arXiv preprint arXiv:1812.07210 , 2018.[40] F. Ang, L. Chen, N. Zhao, Y. Chen, W. Wang, and F. R. Yu, “Robust federated learning with noisy communication,”

IEEETrans. Commun. , vol. 68, no. 6, pp. 3452–3464, 2020.[41] P. Jiang and G. Agrawal, “A linear speedup analysis of distributed deep learning with sparse and quantized communication,”in

Advances in Neural Information Processing Systems , 2018, pp. 2525–2536.[42] F. Zhou and G. Cong, “On the convergence properties of a k -step averaging stochastic gradient descent algorithm fornonconvex optimization,” in Proc. IJCAI , 2018, pp. 3219–3227.[43] H. Yu, R. Jin, and S. Yang, “On the linear speedup analysis of communication efﬁcient momentum SGD for distributednon-convex optimization,” arXiv preprint arXiv:1905.03817 , 2019.[44] S. Caldas et al. , “LEAF: A benchmark for federated settings,” arXiv preprint arXiv:1812.01097arXiv preprint arXiv:1812.01097