[PDF] Federated Learning over Wireless Device-to-Device Networks: Algorithms and Convergence Analysis

Abstract

The proliferation of Internet-of-Things (IoT) devices and cloud-computing applications over siloed data centers is motivating renewed interest in the collaborative training of a shared model by multiple individual clients via federated learning (FL). To improve the communication efficiency of FL implementations in wireless systems, recent works have proposed compression and dimension reduction mechanisms, along with digital and analog transmission schemes that account for channel noise, fading, and interference. This prior art has mainly focused on star topologies consisting of distributed clients and a central server. In contrast, this paper studies FL over wireless device-to-device (D2D) networks by providing theoretical insights into the performance of digital and analog implementations of decentralized stochastic gradient descent (DSGD). First, we introduce generic digital and analog wireless implementations of communication-efficient DSGD algorithms, leveraging random linear coding (RLC) for compression and over-the-air computation (AirComp) for simultaneous analog transmissions. Next, under the assumptions of convexity and connectivity, we provide convergence bounds for both implementations. The results demonstrate the dependence of the optimality gap on the connectivity and on the signal-to-noise ratio (SNR) levels in the network. The analysis is corroborated by experiments on an image-classification task.

Full PDF

aa r X i v : . [ c s . I T ] J a n Federated Learning over WirelessDevice-to-Device Networks: Algorithms andConvergence Analysis

Hong Xing, Osvaldo Simeone, and Suzhi Bi

Abstract

The proliferation of Internet-of-Things (IoT) devices and cloud-computing applications over siloeddata centers is motivating renewed interest in the collaborative training of a shared model by mul-tiple individual clients via federated learning (FL). To improve the communication efﬁciency of FLimplementations in wireless systems, recent works have proposed compression and dimension reduc-tion mechanisms, along with digital and analog transmission schemes that account for channel noise,fading, and interference. This prior art has mainly focused on star topologies consisting of distributedclients and a central server. In contrast, this paper studies FL over wireless device-to-device (D2D)networks by providing theoretical insights into the performance of digital and analog implementationsof decentralized stochastic gradient descent (DSGD). First, we introduce generic digital and analogwireless implementations of communication-efﬁcient DSGD algorithms, leveraging random linear coding(RLC) for compression and over-the-air computation (AirComp) for simultaneous analog transmissions.Next, under the assumptions of convexity and connectivity, we provide convergence bounds for bothimplementations. The results demonstrate the dependence of the optimality gap on the connectivity andon the signal-to-noise ratio (SNR) levels in the network. The analysis is corroborated by experimentson an image-classiﬁcation task.

Index Terms

Federated learning, distributed learning, decentralized stochastic gradient descent, over-the-air com-putation, D2D networks.

Part of this paper has been presented at the IEEE International Workshop on Signal Processing Advances in WirelessCommunications (SPAWC), May 2020 [1].H. Xing and S. Bi are with the College of Electronic and Information Engineering, Shenzhen University, Shenzhen, 518060,China (e-mails: hong.xing, [email protected]).O. Simeone is with the King’s Communications, Learning, and Information Processing (KCLIP) lab, Department ofEngineering, King’s College London, London, WC2R 2LS, U.K. (e-mail: [email protected]).The work of O. Simeone is supported by the European Research Council under the European Union’s Horizon 2020 researchand innovation program under grant 725731.

I. I

NTRODUCTION

With the proliferation of Internet-of-Things (IoT) devices and cloud-computing applicationsover siloed data centres, distributed learning has become a critical enabler for artiﬁcial intel-ligence (AI) solutions [2], [3]. In distributed learning, multiple agents collaboratively train amachine learning model via the exchange of training data, model parameters and/or gradientvectors over geographically distributed computing resources and data.

Federated learning (FL) refers to distributed learning protocols that do not directly exchange the training data in anattempt to reduce the communication load and to limit privacy concerns [4]–[7]. In conventionalFL, multiple clients train a shared model by exchanging model-related parameters with a centralnode. This class of protocols hence relies on a parameter-server architecture, which is typicallyrealized in wireless settings via a base station-centric network topology [8]–[13].There are important scenarios in which a base station-centric architecture is either unavail-able or undesirable due to coverage, privacy, implementation efﬁciency, or fault-tolerance con-siderations [14], [15]. In such cases, distributed learning must rely on a peer-to-peer, edge-based, communication topology that encompasses device-to-device (D2D) links among individuallearning agents over an arbitrary connectivity graph. With the exception of [1], [16], all priorworks on decentralized FL nevertheless assume either ideal or rate-limited but noiseless

D2Dcommunications. This paper is the ﬁrst to offer a rigorous convergence analysis of digital andanalog implementations of wireless D2D FL, with the aim for providing insights on the effectof wireless impairments caused by link blockages, pathloss, channel fading, and interference.

A. Related Work

The problem of alleviating the communication load in FL systems has been widely inves-tigated, mostly under the assumption of noiseless, rate-limited links, and star topologies . Keyelements of these solutions are compression and dimension-reduction operations that map theoriginal model parameters or gradient vectors into representations deﬁned by a limited number ofbits and/or sparsity. Important classes of solutions include unbiased compressors (e.g., [8]–[10])and biased compressors with error-feedback mechanisms (e.g., [11]–[13]).In a

D2D architecture , even in the presence of noiseless communication, devices can onlyexchange information with their respective neighbors, making consensus mechanisms essentialto ensure agreement towards the common learning goal [17], [18]. A well-known protocol inte-grating stochastic gradient (SGD) and consensus is

Decentralized Stochastic Gradient Descent (DSGD) , which has been further extended and improved via gradient tracking [19], as well asvia variance-reduction schemes for large data heterogeneity among agents [20]. Similar to FLin star topologies, the communication overhead in decentralized learning can be reduced viacompression, as demonstrated by the CHOCO-SGD algorithm [21], [22]. The protocol, combin-ing DSGD with biased compression, was studied for strongly convex and smooth objectives in[21] and for non-convex smooth objectives in [22]. It was further combined with event-triggeredprotocols in [23].A large number of recent works have proposed communication strategies and multi-access pro-tocols for

FL in wireless star topologies [24]–[28]. At the physical layer, over-the-air computation(AirComp) was investigated in [13], [25], [29]–[32] as a promising solution to support simulta-neous transmissions by leveraging the waveform superposition property of the wireless medium.Unlike conventional digital communication over orthogonal transmission blocks, AirComp isbased on analog, e.g., uncoded, transmission, which enables the estimate of aggregated statisticsdirectly from the received baseband samples. This reduces the communication burden, relievingthe network from the need to decode individual information separately for all participatingdevices. The impact of AirComp on the performance of FL was studied in [31], [32]. Theauthors in [31] proposed an adaptive learning-rate scheduler and investigated the convergenceof the resulting protocol. Reference [32] derived sufﬁcient conditions in terms of signal-to-noiseratio (SNR) for the FL algorithm to attain the same convergence rate as in the ideal noiselesscase.The literature on decentralized FL in wireless D2D architecture is, in contrast, still quitelimited. A DSGD based algorithm termed MATCHA was proposed in [33] by accounting forinterference among nearby links. Adopting a general interference graph, the scheme was basedon sampling a matching decomposition of the graph, whereby the connectivity-critical links areactivated with a higher probability. By relying on a matching decomposition, MATCHA schedulesnoiseless non-interfering communication links in parallel. No attempt was made to leverage non-orthogonal physical layer protocols such as AirComp. In contrast, in the conference version [1] ofthis paper, wireless protocols for both digital and analog implementations of error-compensatedDSGD were designed by including AirComp. However, no theoretical analysis was offered onthe convergence of the considered wireless decentralized FL algorithms.

B. Main Contributions

In this paper, we provide for the ﬁrst time a rigorous analysis of digital and analog transmissionprotocols tailored to FL over wireless D2D networks in terms of their convergence properties.The contributions of this paper are speciﬁcally summarized as follows.1) We introduce generic digital and analog wireless implementations of DSGD algorithms.The protocols rely on compression via random linear coding (RLC) as applied to model differ-ential information. The general protocols enable broadcasting for digital transmission; and bothbroadcasting and AirComp for analog transmission.2) Under the assumptions of convexity and connectivity, we derive convergence bounds for thegeneric digital wireless implementation. The result demonstrates the dependence of the optimalitygap on the connectivity of the graph and on the model differential estimation error due tocompression.3) We also provide convergence bounds for the analog wireless implementation. The analysisreveals the impact of topology and channel noise, as well as the importance of implementing an adaptive consensus step size . To the best of our knowledge, this is the ﬁrst time that an adaptiveconsensus step size is shown to be beneﬁcial for convergence.4) We provide numerical experiments for image classiﬁcation, conﬁrming the beneﬁts ofthe proposed adaptive consensus rate, and demonstrating the agreement between analytical andempirical results.The remainder of this paper is organized as follows. The system model is presented in SectionII. Digital and analog transmission protocols are introduced in Section III. The convergence anal-ysis for both implementations is presented in Section IV and Section V, respectively. Numericalperformance results are described in Section VI, followed by conclusions in Section VII.

C. Notations

We use the upper case boldface letters for matrices and lower case boldface letters for vectors.We also use k · k to denote the Euclidean norm of a vector or the spectral norm of a matrix,and k · k F to denote the Frobenius norm of a matrix. The average of vectors x i over i ∈ V is deﬁned as ¯ x = K P i ∈V x i . Notations Tr ( · ) and ( · ) T denote the trace and the transpose of amatrix, respectively. E [ · ] stands for the statistical expectation of a random variable. I representsan identity matrix with appropriate size, and , indicates a mathematical deﬁnition. λ i ( · ) denotesthe i th largest eigenvalue of of a matrix. II. S

YSTEM M ODEL

In this paper, we consider a FL problem in a decentralized setting as shown in Fig. 1, in whicha set V = { , . . . , K } of K devices can only communicate with their respective neighbors overa wireless D2D network whose connectivity is characterized by an undirected graph G ( V , E ) ,with V denoting the set of nodes and E ⊆ { ( i, j ) ∈ V × V | i = j } the set of edges. The set ofneighbors of node i is denoted as N i = { j ∈ V | ( i, j ) ∈ E } . Following the FL framework, eachdevice has available a local data set, and all devices collaboratively train a machine learningmodel by exchanging model-related information without directly disclosing data samples to oneanother. ! " Fig. 1. The connectivity graph G ( V , E ) for a wireless D2D network. A. Learning Model

Each device i ∈ V has access to its local data set D i , which may have non-empty intersectionwith the data set D j of any other device j ∈ V , i = j . All devices share a common machinelearning model class, which is parametrized by a vector θ ∈ R d × . As a typical example, themodel class may consist of a neural network with a given architecture. The goal of the networkis to solve the empirical risk minimization problem [19], [22] (P0) : Minimize θ F ( θ ) , K X i ∈V f i ( θ ) , where f i ( θ ) = |D i | P ξ ∈D i l ( θ , ξ ) is the local empirical risk function for device i with l ( θ ; ξ ) denoting the loss accruing from parameter θ on data sample ξ ∈ D i , which may include theeffect of regularization. To enable decentralized learning, we adopt CHOCO-SGD, a communication-efﬁcient variantof DSGD [21]. At the start of each iteration t + 1 , device i ∈ V has in its memory its currentmodel iterate θ ( t ) i , the corresponding estimated version ˆ θ ( t ) i and the estimated iterates ˆ θ ( t ) j forall its neighbors j ∈ N i . We note that an equivalent version of the algorithm that requires lessmemory can be found in [21, Algorithm 6], but we do not consider it here since it does notchange the communication requirements. Furthermore, at each iteration t , device i ∈ V ﬁrstexecutes a local update step by SGD based on its data set D i as θ ( t +1/2) i = θ ( t ) i − η ( t ) ˆ ∇ f i ( θ ( t ) i ) , (1)where η ( t ) denotes the learning rate, and ˆ ∇ f i ( θ ( t ) i ) is an estimate of the exact gradient ∇ f i ( θ ( t ) i ) obtained from a mini-batch D ( t ) i ⊆ D i of the data set, i.e., ˆ ∇ f i ( θ ( t ) i ) = |D ( t ) i | P ξ ∈D ( t ) i ∇ l ( θ ( t ) i ; ξ ) .Then, each device i ∈ V compresses the difference θ ( t +1/2) i − ˆ θ ( t ) i between the locally updatedmodel (1) and the previously estimated iterate ˆ θ ( t ) i . The compressed difference C ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) is then exchanged with the neighbors of node i . Assuming that communication is reliable — anassumption that we will revisit in the rest of the paper. Each device i ∈ V updates the estimatedmodel parameters ˆ θ ( t ) j for itself and for its neighbors as ˆ θ ( t +1) j = ˆ θ ( t ) j + D ( t ) (cid:16) C ( t ) ( θ ( t +1/2) j − ˆ θ ( t ) j ) (cid:17) , j ∈ { i } ∪ N i , (2)where D ( t ) ( · ) is a decoding function. Next, device i ∈ V executes a consensus update step bycorrecting the updated model (1) using the estimated parameters (2) as θ ( t +1) i = θ ( t +1/2) i + ζ ( t ) X j ∈N i ∪{ i } w ij (cid:16) ˆ θ ( t +1) j − ˆ θ ( t +1) i (cid:17) , (3)where ζ ( t ) is the consensus rate, and the mixing matrix W = W T ∈ R K × K is selected to be doubly stochastic , i.e., [ W ] ij = w ij ≥ , W = , T W = T and k W − T / K k < .A typical choice is to set w ij = α for all j ∈ N i , w ii = 1 − |N i | α , and w ij = 0 otherwise,where constant α > is a design parameter. We postpone discussion regarding the compressionoperator C ( t ) ( · ) and the decoding operator D ( t ) ( · ) to Section II-C. The considered decentralizedlearning protocol is summarized in Algorithm 1.Finally, we make the following assumptions that are widely adopted in the literature ondecentralized stochastic optimization [21]. Algorithm 1:

Decentralized Learning with Noiseless Communication

Input :

Consensus step size ζ ( t ) , SGD learning step size η ( t ) , connectivity graph G ( V , E ) and mixing matrix W Initialize at each node i ∈ V : θ (0) i , ˆ θ (0) j = , ∀ j ∈ N i S { i } ; for t = 0 , , . . . , T − do for each device i ∈ V do in parallel update θ ( t +1/2) i = θ ( t ) i − η ( t ) ˆ ∇ f i ( θ ( t ) i ) ; compress the difference θ ( t +1/2) i − ˆ θ ( t ) i to obtain C ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) ; for each neighboring device j ∈ N i do in parallel send C ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) ; receive C ( t ) ( θ ( t +1/2) j − ˆ θ ( t ) j ) ; end update ˆ θ ( t +1) j = ˆ θ ( t ) j + D ( t ) (cid:16) C ( t ) ( θ ( t +1/2) j − ˆ θ ( t ) j ) (cid:17) , for j ∈ { i } ∪ N i ; udpate θ ( t +1) i = θ ( t +1/2) i + ζ ( t ) P j ∈N i ∪{ i } w ij (cid:16) ˆ θ ( t +1) j − ˆ θ ( t +1) i (cid:17) . end endOutput: θ ( T − i , ∀ i ∈ V Assumption 2.1:

Each local empirical risk function f i ( θ ) , i ∈ V , is L -smooth and µ -stronglyconvex, that is, for all θ ∈ R d × and θ ∈ R d × , it satisﬁes the inequalities f i ( θ ) ≤ f i ( θ ) + ∇ f i ( θ ) T ( θ − θ ) + L k θ − θ k , (4)and f i ( θ ) ≥ f i ( θ ) + ∇ f i ( θ ) T ( θ − θ ) + µ k θ − θ k . (5) Assumption 2.2:

The variance of the mini-batch gradient ˆ ∇ f i ( θ i ) is bounded as E D ( t ) i [ k ˆ ∇ f i ( θ i ) − ∇ f i ( θ i ) k ] ≤ σ i , (6)and its expected Euclidean norm is bounded as E D ( t ) i [ k ˆ ∇ f i ( θ i ) k ] ≤ G , (7)where the expectation E D ( t ) i [ · ] is taken over the selection of a mini-batch D ( t ) i ⊆ D i . B. Communication Model

As seen in Fig. 2, at the end of every iteration t , communication takes place within onecommunication block of a total number N of channel uses spanning over M equal-length slots,denoted by S = { , . . . , M } . Slow fading remains constant across all iterations, and is binary,determining whether a link is blocked ar not. A link ( i, j ) ∈ E is by deﬁnition not blocked ,while all the other links ( i, j ) / ∈ E are blocked. We assume that the connectivity graph G ( V , E ) with all the unblocked links as edges satisﬁes the following assumption. Assumption 2.3:

Graph G ( V , E ) is a connected graph.For all unblocked links ( i, j ) ∈ E , the channel coefﬁcient between device i and j is modelled as h ′ ( t ) ij , p A (cid:18) d ij d (cid:19) − γ h ( t ) ij , (8)where the fast fading coefﬁcient h ( t ) ij ∼ CN (0 , remains unchanged within one communicationblock and varies independently across blocks, and the path loss gain √ A ( d / d ij ) γ /2 is constantacross all iterations, where A is the average channel power gain at reference distance d ; d ij isthe distance between device i and j ; and γ is the path loss exponent factor. ! " t T - Tx 1, 6 2, 7 3 4 5Rx 2, 3, 5 1, 3, 4, 5 1, 2, 4, 5 3, 5, 7 3, 4, 6, 7 ! " !" .)/0()&"%* t !" N Fig. 2. Timeline of training iterations and communication blocks: A communication block of N channel uses, divided into M slots, is employed to exchange the compressed difference between model parameters among neighboring devices. Each device is subject to an energy constraint of

N P ( t ) per communication block. If a deviceis active for M ′ ≤ M slots, the energy per symbol is hence given by NP ( t ) M ′ N / M = P ( t ) MM ′ . Themean-square power of the added white Gaussian noise (AWGN) is denoted as N . C. Compression

In this subsection, we describe the assumed compression operator C ( · ) and decompressionoperator D ( · ) that are used in the update (3). We speciﬁcally adopt random linear coding (RLC) compression [10]. Let A ( t ) = √ m HR ( t ) be the linear encoding matrix, where H ∈ {± } m × d with m ≤ d is a partial Hadamard matrix with mutually orthogonal rows, i.e., d HH T = I ;and R ( t ) ∈ R d × d is a diagonal matrix with its diagonal entries [ R ( t ) ] ii , r ( t ) i drawn fromuniform distributions such that Pr( r ( t ) i = 1) = Pr( r ( t ) i = −

1) = 0 . , for all i = 1 , . . . , d .The compression operator is given by the linear projection C ( t ) ( u ) = A ( t ) u , while decodingtakes place as D ( t ) ( v ) = md ( A ( t ) ) T v . The concatenation of the compression and decompressionoperators, namely, D ( t ) ( C ( t ) ( u )) = md ( A ( t ) ) T A ( t ) u , satisﬁes the compression operator [10], [21] E (cid:13)(cid:13)(cid:13) u − md ( A ( t ) ) T A ( t ) u (cid:13)(cid:13)(cid:13) = (cid:16) − md (cid:17) k u k , for all u ∈ R d × . (9)We note that the random matrices { R ( t ) } need to be shared among devices prior to the start ofthe communication protocol such that the same random sequence { A ( t ) } is agreed upon by alldevices. III. D IGITAL AND A NALOG T RANSMISSION P ROTOCOLS

In this section, we describe digital and analog wireless implementations of the decentralizedlearning algorithm reviewed in the previous section. The implementations are meant to serve asprototypical templates for the deployment of decentralized learning. In practice, speciﬁc protocolsare in need to specify the scheduling strategy used to allocate slots in Fig. 2 to devices. Theanalysis in this paper, to be detailed in Section IV and Section V, applies to any schedulingalgorithm.

A. Digital Transmission

In digital transmission protocol, devices represent their model updates as digital messagesfor transmission. The number B ( t ) i of bits that device i ∈ V can successfully broadcast to itsneighbors during a slot allocated to it by the scheduling algorithm is limited by the neighboringdevice with the worst channel power gain. Accordingly, we have B ( t ) i = NM log (cid:18) P ( t ) N M min j ∈N i | h ′ ( t ) ij | (cid:19) . (10)We recall that, in (10), the number M of time slots per iteration is decided by the schedulingscheme.To quantize the encoded vector A ( t ) i ( θ ( t +1/2) i − ˆ θ ( t ) i ) ∈ R m × into B ( t ) i bits, we employ asimple per-element b -bit quantizer Q b ( · ) with chip-level precision so that b = 64 or b = 32 is for double-precision or single-precision ﬂoating-point, respectively, according to IEEE standard. Communication constraints thus impose the inequality bm ( t ) i ≤ B ( t ) i , which is satisﬁedby setting m ( t ) i = ⌊ B ( t ) i / b ⌋ . Based on the (received) quantized signal, each device i ∈ V updatesthe estimated model parameters of its own as well as of its neighbors in j ∈ { i } ∪ N i as (cf. (2)) ˆ θ ( t +1) j = ˆ θ ( t ) j + m ( t ) j d ( A ( t ) j ) T Q b (cid:16) A ( t ) j ( θ ( t +1/2) j − ˆ θ ( t ) j ) (cid:17) . (11)In order to implement update (11), each node j ∈ V and its neighbors in set N j can share apriori a common sequence of (pseudo-)random matrix ˜ A ( t ) j ∈ R m × d satisfying the assumptiondescribed in Section II-C. If node j sends its current value m ( t ) j to all neighbors and if m ( t ) j ≤ m ,all nodes can thus select the same submatrix A ( t ) j from ˜ A ( t ) j to evaluate (11). The described digitalimplementation is summarized in Algorithm 2 in Appendices. B. Analog Transmission

With analog transmission, devices directly transmit their respective updated parameters bymapping analog signals to channel uses, without the need for digitization. As studied in [25],[30], in addition to broadcast as in digital transmission, it is also useful to schedule all devicesthat share a common neighbor for simultaneous transmission in order to enable AirComp. Theconsidered class of protocols can accommodate any scheduling scheme that, as in [1], operatesover pairs of consecutive time slots in order to leverage AirComp. In the ﬁrst slot of each pair,one or more center nodes receive a superposition of the signals simultaneously transmitted byall their respective neighbors. In the second slot, the center nodes serve as broadcast transmitterscommunicating to all their neighbors. The total number M of time slots is thus given by twiceas the number n of pairs of time slots, which is speciﬁed by the scheduling policy in use.To elaborate on the operation of analog transmission protocol, we will use the followingnotation. For each device i ∈ V , we deﬁne a set S Tx i ⊆ S of transmission slots, with S Tx i = S BT i ∪ S AT i partitioned into disjoint subsets S BT i and S AT i ( S BT i ∩ S AT i = ∅ ). Subset S BT i denotesthe set of transmission slots in which device i broadcasts to its neighbors, and S AT i denotesthe set of slots in which device i transmits to enable AirComp. Similarly, we deﬁne the set S Rx i ⊆ S of receiving slots for device i as S Rx i = S BR i ∪ S AR i with S BR i ∩ S AR i = ∅ , where S BR i and S AR i denote the sets of receiving slots in which device i ∈ V receives from a transmitterin broadcast and AirComp modes, respectively. Fig. III-B serves as an example illustrating theabove deﬁnitions. ! " !" ! " !" Fig. 3. An example illustrates pairs of consecutive time slots for analog transmissions given a scheduling scheme that yields M = 6 for the connectivity graph shown in Fig. 1. There are following non-empty sets for each node: S AT1 = { } , S BT1 = { } , S AR1 = { } , S BR1 = { } ; S AT2 = { , } , S BR2 = { , } ; S BT3 = { } , S AR3 = { } ; S AT4 = { , } , S BT4 = { } , S AR4 = { } , S BR4 = { , } ; S AT5 = { } , S BT5 = { } , S AR5 = { } , S BR5 = { } ; S AT6 = { } , S BR6 = { } ; and S AT7 = { , } , S BR7 = { , } . We now describe the transmitted and the received signals in each pair of slots of the commu-nication protocol.

Odd slots : All devices i ∈ V operating in AirComp mode for a center node j in an odd slot s ∈ S AT i ∩ S AR j concurrently transmit the signals x ( t,s ) ij = q γ ( t,s ) j h ′ ( t ) ij w ji A ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) , (12)where γ ( t,s ) j is a power scaling factor for channel alignment at device j . The receiving centernode, device j obtains y ( t,s ) j = q γ ( t,s ) j X i ∈N ( s ) j w ji A ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) + n ( t,s ) j , (13)where N ( s ) j is the neighboring set of device j operating in AirComp at slot s , and n ( t,s ) j ∼CN ( , N I ) is the received AWGN at slot s of iteration t . Device j estimates the combined model parameters P i ∈N ( s ) j w ji ( θ ( t +1/2) i − ˆ θ ( t ) i ) via the linear estimator ˆ y ( t,s ) j = md ( A ( t ) ) T ℜ (cid:26) y ( t,s ) j / q γ ( t,s ) j (cid:27) . (14) Even slots : Any device i ∈ V operating in broadcast mode in an even slot s ∈ S BT i transmitsa signal x ( t,s ) i = q α ( t,s ) i A ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) , (15) where α ( t,s ) i is device i ’s transmitting power scaling factor in slot s ∈ S BT i of iteration t . Eachneighboring device j ∈ N ( s ) i , with s ∈ S BR j , receives from device i the signal y ( t,s ) ij = q α ( t,s ) i h ′ ( t ) ij A ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) + n ( t,s ) j , (16)where n ( t,s ) j ∼ CN ( , N I ) is the received AWGN. Device j estimates the signal θ ( t +1/2) i − ˆ θ ( t ) i via the linear estimator ˆ y ( t,s ) ij = w ji md ( A ( t ) ) T ℜ  y ( t,s ) ij q α ( t,s ) i h ′ ( t ) ij  , (17)where ℜ{·} denotes the real part of its argument.Next, device j ∈ V updates its estimate of the combined model parameters from all neighboringdevices in N j by aggregating the estimates obtained at all receiving slots in set S Rx j = S BR j ∪S AR j as ˆ y ( t +1) j = ˆ y ( t ) j + X s ∈S AR j ˆ y ( t,s ) j + X s ∈S BR j ˆ y ( t,s ) i s j , (18)where node i s ∈ N j is the node that transmits in broadcast mode in slot s ∈ S BR j . The initialestimate of the combined model parameters is given by ˆ y (0) i = , ∀ i ∈ V .The power scaling parameters γ ( t,s ) j (cf. (12)) and α ( t,s ) i (cf. (15)) for s ∈ S Tx i need to beproperly chosen such that the power consumed by device i ∈ V per communication blocksatisﬁes X s ∈S AT i E [ k x ( t,s ) ij s k ] + X s ∈S BT i E [ k x ( t,s ) i k ] ≤ N P ( t ) , ∀ i ∈ V , (19)where node j s is the center node connected to node i in slot s ∈ S AT i . Applying a simple equalpower policy across different transmission slots of a device i ∈ V for all communication blocks,we have (cf. (19)) E [ k x ( t,s ) ij k ] ≤ N P ( t ) / |S Tx i | , ∀ s ∈ S AT i , (20) E [ k x ( t,s ) i k ] ≤ N P ( t ) / |S Tx i | , ∀ s ∈ S BT i . (21)In addition, device j ∈ V needs to update the estimate of its own model parameter ˆ θ ( t +1) j as ˆ θ ( t +1) j = ˆ θ ( t ) j + md ( A ( t ) ) T A ( t ) ( θ ( t +1/2) j − ˆ θ ( t ) j ) . (22)Finally, device j ∈ V approximates update (3) as θ ( t +1) j = θ ( t +1/2) j + ζ ( t ) (cid:16) w jj ˆ θ ( t +1) j + ˆ y ( t +1) j − ˆ θ ( t +1) j (cid:17) . (23)To sum up, the proposed analog implementation is presented in Algorithm 3 in Appendices. IV. C

ONVERGENCE A NALYSIS F OR D IGITAL T RANSMISSION

In this section, we derive convergence properties of the general class of digital transmissionprotocols presented in Section III-A. The analysis holds for any ﬁxed transmission schedule,which determines the number M of slots. We start by recalling that, at each iteration t , update(11) is carried out by device i ∈ V for all nodes j ∈ { i } ∪ N i . In (11), the concatenation ofcompression, quantization, and decompression yields an output vector m ( t ) j d ( A ( t ) j ) T Q b ( A ( t ) j u ( t ) j ) for the input vector u ( t ) j = θ ( t +1/2) j − ˆ θ ( t ) j . The number m ( t ) j = ⌊ B ( t ) j / b ⌋ of rows of matrix A ( t ) j at iteration t depends on the current rate (10) supported by the fast fading channels betweendevice j and its neighbors. Taking the randomness of the fading realizations into account, thecounterpart of the compression operator (9) under digital transmission is given by the followinglemma. Lemma 4.1:

On average over RLC and fading channels, the mean-square estimation error forthe concatenation of compression, quantization, and decompression under digital transmissionsatisﬁes E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) u − m ( t ) i d ( A ( t ) i ) T Q b ( A ( t ) i u ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:0) − ω ( t ) (cid:1) k u k , (24)for all u ∈ R d × and for all i ∈ V , with ω ( t ) = min i ∈V ω ( t ) i , where we have ω ( t ) i = d + d P dn =2 G ( t ) i ( n ) with the function G ( t ) i ( n ) : Z + R , i ∈ V , deﬁned as G ( t ) i ( n ) = exp − N P ( t ) M A (cid:16) nb MN − (cid:17) X j ∈N i (cid:18) d ij d (cid:19) r ! . (25) Proof:

Please refer to Appendix C.By (24), the parameter ω = min t ∈{ ,...,T − } ω ( t ) ∈ [0 , is a measure of the quality of thereconstruction of the model difference used in update (11). Supposing static channel conditions inwhich the transmission rate (10) remains constant over iterations, we have m ( t ) i = m i , ∀ i ∈ V . Inthese conditions, the right-hand side (RHS) of (24) can reduce to (1 − m i d ) k u k (see Appendix C),which is exactly the RHS of (9) given m = m i and A ( t ) = A ( t ) i . Note that ω ( t ) is increasing with P ( t ) . In particular, when the transmission power P ( t ) → ∞ , it is seen in (25) that G i ( n ) → ,thus leading to zero mean-square estimation error ( ω ( t ) → ).With Lemma 4.1, the convergence properties of the digital protocol can be quantiﬁed ina manner similar to [21, Theorem 4]. To this end, we deﬁne the following topology-relatedparameters dependent on the mixing matrix W : the spectral gap δ = 1 − k T K − W k ; the parameter β = k I − W k ; and the function p ( δ, ω ) = δ ω δ + δ +4 β +2 δβ − δω ) that depends on thespectral gap δ and on the model-difference estimation quality ω = min t ∈{ ,T − } { ω ( t ) } . Then, theconvergence of the digital implementation is provided by the following theorem. Theorem 4.1 (Optimality Gap for Digital Transmission [21, Theorem 19]):

For learning rate η ( t ) = . µ t + a with a ≥ max { p ( δ,ω ) , Lµ } , and consensus step size ζ ( t ) = p ( δ,ω ) δ , ζ ( δ, ω ) , onaverage over RLC and fading channels { h ′ ( t ) ij } , Algorithm 2 yields an optimality gap satisfying E [ F (˜ θ T )] − F ∗ ≤ E " S T T − X t =0 w ( t ) F (¯ θ ( t ) ) − F ∗ ≤ µ . a − . a S T v (0) e + 1 . a + T ) TµS T ¯ σ K | {z } centralized error + 158 . × LTµ ( p ( δ, ω )) S T G | {z } consensus error , (26)where w ( t ) = ( a + t ) ; S T = P T − t =0 w ( t ) ; ˜ θ T = S T P T − t =0 w ( t ) ¯ θ ( t ) is the weighted sum ofthe average iterate ¯ θ ( t ) = K P i ∈V θ ( t ) i across all the communication rounds; F ∗ denotes theoptimum objective value for problem (P0); µ is the parameter for µ -strongly convex function F ( θ ) ; v (0) e = k ¯ θ (0) − θ ∗ k measures the initial distance to the optimal model parameter; and ¯ σ = K P i ∈V σ i is the average of the mini-batch gradient variance over all devices. Remark 4.1:

The labels “centralized error” and “consensus error” in (26) refer to decompositionof the upper bound on optimality gap into a term that accounts for the performance of the averagemodel ¯ θ ( t ) = K P i ∈V θ ( t ) i — the “centralized error”, and a term that measures the disagreementamong agents — the “consensus error”. To gain insights into how wireless resources, channelconditions, and topology of the connectivity graph affect the performance of the digital wirelessimplementation, we can rewrite (26) using relevant parameters in “big O” notation as [21,Theorem 4] E [ F (˜ θ T )] − F ∗ ≤ O (cid:18) ¯ σ µKT (cid:19) + O (cid:18) ¯ σ µKT + LG µ ( p ( δ, ω )) T (cid:19) + O (cid:16) µT (cid:17) . (27)This result shows that when the total number of iterations T is sufﬁciently large, the optimalitygap (27) behaves as O (cid:16) ¯ σ µKT (cid:17) , which recovers the convergence rate of centralized SGD withideal communications. However, when the wireless communication resources are limited, andhence N T ≪ ∞ , the second term, scaling as O (cid:0) T (cid:1) , becomes equally important in (27),demonstrating the impact of the topology via δ and β , as well as the effect of the quality ofdigital transmission via ω . Since the function p ( δ, ω ) is monotonically increasing with δ ∈ [0 , and ω ∈ [0 , , the second term in (27) decreases with δ and ω . This implies that convergence is improved for more connected graphs with larger δ [34], and for smaller estimation errors withlarger ω . A. Numerical Illustration

In this subsection, we corroborate the analysis by numerically evaluating each constituentterm in the upper bounds (26) on the optimality gap. We consider a setup consisting of K = 20 devices that are located at randomly and independently selected distances in the interval [50 , m away from a center position, with all angles (in radius) uniformly distributed in the interval [0 , π ] . The connectivity graph, accounting for the impact of slow fading, is modelled as: (i) acomplete graph; (ii) a planar × grid graph; (iii) a planar × grid graph with torus wrapping;or (iv) a star graph as in conventional FL. We set the strong-convexity parameter as µ = 0 . ,and the smoothness factor as L = 0 . . We plot the upper bound at iteration t normalized by thecorresponding value at t = 0 , hence evaluating the improvement in the expected optimality gap.The SNR is deﬁned as the received SNR, averaged over fast fading, at a distance of meters(m) away from the deployment center. Other parameters are set as d = 1 m, A = 10 − . , γ = 3 . , N = − dBm and α = 2/( λ ( L ) + λ K − ( L )) , where L is the Laplacian of theconnectivity graph.We apply the vertex coloring-based scheduling proposed in [1] in order to determine thenumber of slots. Fig. 4 plots separately the centralized and the consensus errors, as well asthe overall error in (26), for different number of iterations t = 2000 and t = 5000 under aplanar topology with torus wrapping. The centralized error does not depend on the SNR, and itdecreases at the fastest rate over iterations. The consensus error decreases with the received SNRdue to the improvement in the parameter ω that characterizes the quality of model reconstruction(cf. (26)). As a result, the consensus error dominates the optimality gap until the received SNRlevel increases to a sufﬁciently large value dependent on the iteration t . Furthermore, the overalloptimality gap approaches the centralized error as the SNR increases.We now turn to an analysis of the impact of the topology of the connectivity graph on theconvergence for digital transmission. To this end, we set the same received SNR for all devicesignoring the impact of pathloss in order to isolate the impact of different topologies. We employTDMA-based scheduling that assigns only one device as the transmitter at one slot such thatthere are equal number M = K of slots for all topologies. Fig. 5 reports the optimality gap,along with the constituent errors in (26). The optimality gap decreases with the spectral gap δ of 5HFHLYHG615G% í í í í 1 R U P D OL ]H G XSS H U ERXQG RQ  [ F ( ̃ θ θ ̃] − F ∗ RSWLPDOLW\JDSFRQVHQVXVHUURUFHQWUDOL]HGHUURU (a) t = 2000 . 5HFHLYHG615G% í í í í 1 R U P D OL ]H G XSS H U ERXQG RQ  [ F ( ̃ θ θ ̃] − F ∗ RSWLPDOLW\JDSFRQVHQVXVHUURUFHQWUDOL]HGHUURU (b) t = 5000 .Fig. 4. Normalized upper bounds (26) on the optimality gap versus the received SNR levels for a planar grid graph with toruswrapping and N = 10 ( a = 4 × ). the connectivity graph, which equals δ = 1 , . , . , and . , for complete graph, planargraph with and without torus wrapping, and star topology, respectively. Fig. 5 shows that theconsensus error contributes very little to the overall error for densely connected graphs, such asthe complete graph and the planar grid with torus wrapping, while it becomes dominant in lessdensely connected graphs, such as the planar grid and the star graph. &* SODQDUJULGZZUDS SODQDUJULG VWDU 1 R U P D OL ]H G XSS H U ERXQG RQ  [ F ( ̃ θ θ ̃] − F ∗ RSWLPDOLW\JDSFRQVHQVXVHUURUFHQWUDOL]HGHUURU Fig. 5. Normalized upper bounds (26) on the optimality gap for different topologies of the connectivity graph with M = K ,SNR = 30 dB, and N = 10 for t = 5000 ( a = 1 . × ). V. C

ONVERGENCE A NALYSIS F OR A NALOG T RANSMISSION

In this section, we derive convergence properties for the general class of analog transmissionschemes described in Section III-B. The analysis holds for any ﬁxed scheduling scheme operating in pairs of slots as described in Section III-B. We start by noting that, while for the digitalimplementation, the number m ( t ) j = ⌊ B ( t ) j / b ⌋ rows of matrix A ( t ) j for device j ∈ V dependson the fast fading channels between device j and its neighbors, for the analog implementation,the number m of rows of matrix A ( t ) is ﬁxed as the number m = N / M of available channeluses in each slot of the communication block. Therefore, for analog communication, we candirectly use (9) to quantify the quality of the estimate of the model difference used in update(22). However, due to the presence of channel noise, update (18) for the combined parametersof neighboring devices does not satisfy ¯ θ ( t +1) = ¯ θ ( t +1/2) . This calls for a novel derivation ofconvergence properties that does not follow from [21, Theorem 19].Next, we relate the update in (23), which is subject to Gaussian noise, to the noiseless updatein (3) in the following lemma. Lemma 5.1:

The consensus update (23) for analog implementation is equivalent to θ ( t +1) i = θ ( t +1/2) i + ζ ( t ) X j ∈N i ∪{ i } w ij (cid:16) ˆ θ ( t +1) j − ˆ θ ( t +1) i (cid:17) + ζ ( t ) t X τ =0 md ( A ( τ ) ) T ˜ n ( τ ) i , (28)where ˜ n ( t ) i ∼ N (0 , ˜ N ( t )0 i I ) is the effective noise with power ˜ N ( t )0 i = 12 N N P ( t ) X s ∈S AR i max j ∈N ( s ) i ( |S j | w ij k u ( t ) j k (cid:12)(cid:12) h ′ ( t ) ij (cid:12)(cid:12) ) + X s ∈S BR i |S j s | w ij s k u ( t ) j s k (cid:12)(cid:12) h ′ ( t ) ij s (cid:12)(cid:12) ! , (29)where we denoted u ( t ) j = θ ( t +1/2) j − ˆ θ ( t ) j . Proof:

Please refer to Appendix D.Comparing (28) with (3) reveals that consensus updates (23) for analog implementationare noisy approximation of those for ideal communication used by Algorithm 1 and digitalimplementation used by Algorithm 2. Note, in particular, that by (28), unlike Algorithms 1-2,we no longer have the preservation of the average of the model parameters across the network.In fact, the average ¯ θ ( t +1) = 1/ K P i ∈V θ ( t +1) i of the model parameters, obtained by averagingover i ∈ V on both sides of (28), is corrupted by Gaussian noise as ¯ θ ( t +1) = ¯ θ ( t +1/2) + ζ ( t ) K X i ∈V t X τ =0 md ( A ( τ ) ) T ˜ n ( τ ) i . (30)Next, we investigate how the noisy consensus updates given by (28) affect the convergenceof Algorithm 3. In the proof of Theorem 19 in [21, Appendix D], the key step leading to the ﬁnal convergence result is to construct an error sequence deﬁned as e ( t ) = X i ∈V E k ¯ θ ( t ) − θ ( t ) i k | {z } consensus error + X i ∈V E k ˆ θ ( t +1) i − θ ( t +1/2) i k | {z } compression error , (31)where the ﬁrst term measures the consensus error, while the second term accounts for the impactof compression. It is shown in [21, Appendix D] that the above two sub-terms are coupled witheach other via a recursive relation that yields the main result summarized in the previous section.However, as discussed, their approach cannot be directly applied to analog transmission due tothe fact that the consensus preserving property no longer holds, i.e., ¯ θ ( t +1) = ¯ θ ( t +1/2) , as seenfrom (30). To address this challenge, we need a new upper bound on (31). To this end, we ﬁrstintroduce the function p ( t ) ( δ, ω ) = min { ˜ p ( t ) ( δ, ω ) , p ( δ, ω ) } , (32)where ˜ p ( t ) ( δ, ω ) = δζ ( δ, ω ) q ˜ N ,T t / a ′ + 1 − (cid:18) δ ω β (cid:19) ( ζ ( δ, ω )) (cid:0) q ˜ N ,T t / a ′ + 1 (cid:1) , (33)with p ( δ, ω ) and ζ ( δ, ω ) deﬁned in Theorem 4.1. We also denote ˜ N ,T = max t ∈{ ,...,T − } n X i ∈V ˜ N ( t )0 i o , (34)and ω = m / d . We will see that the function p ( t ) ( δ, ω ) plays an analogous role to p ( δ, ω ) fordigital transmission in that it contributes to the decaying rate of the error sequence deﬁned in(31). Speciﬁcally, both ˜ p ( t ) ( δ, ω ) and p ( δ, ω ) are increasing functions of δ and ω .A key ingredient of the proposed approach is to design a consensus step size that is adaptiveto the iteration index t . This is because a consensus step size ζ ( t ) = ζ , as used in [21], [22] forideal communication, causes the accumulated channel noise term in (30) to grow with t , leadingto a possible divergence of the upper bound on e ( t ) (see (68) in [, Appendix]. for details). Tosuppress the error growth induced by channel noise, we propose an adaptive consensus step size ζ ( t ) = ζ ( δ,ω ) √ ˜ N ,T t / a ′ +1 , based on which we have the following lemma. Lemma 5.2:

For learning rate η ( t ) = . µ t + a with a ≥ max { p ( t ) ( δ,ω ) , Lµ } , adaptive consensusstep size ζ ( t ) = ζ ( δ,ω ) √ ˜ N ,T t / a ′ +1 with a ′ > a q ˜ N ,T , and T < ∞ , on average over RLC, the errorsequence in (31) satisﬁes e ( t +1) ≤ (cid:18) − p ( t ) ( δ, ω )2 (cid:19) e ( t ) + 1 p ( t ) ( δ, ω ) ( η ( t ) ) (cid:18) KG + A ( δ, ω ) q ˜ N (cid:19) , (35) where A ( δ, ω ) = δ ( ζ ( δ, ω ) a ′ ) (2 − ω ) ω d ( µ . ) is a function depending on the spectral gap δ of graph G ( V , E ) and the estimation quality ω . Proof:

Please refer to Appendix E.

Remark 5.1:

When communication is ideal, i.e., ˜ N ,T → , the function p ( t ) ( δ, ω ) reduces to p ( δ, ω ) , and the upper bound (35) reduces to the result derived in [21, Lemma 21].By leveraging Lemma 5.2, the convergence of the analog implementation scheme is revealed asfollows. Theorem 5.1 (Optimality Gap for Analog Transmission):

For a given total number T ofiterations, learning rate η ( t ) = . µ t + a with a ≥ max { p ( t ) ( δ,ω ) , Lµ } , and adaptive consensusstep size ζ ( t ) = ζ ( δ,ω ) √ ˜ N ,T t / a ′ +1 with a ′ > a q ˜ N ,T , on average over RLC, channel noise andconditioned on the fading realizations { h ′ ( t ) ij } , Algorithm 3 yields an optimality gap satisfying E [ F (˜ θ T )] − F ∗ ≤ E " S T T − X t =0 w t F (¯ θ ( t ) ) − F ∗ ≤ µ . a − . a S T v (0) e + 1 . a + T ) TµS T ¯ σ K | {z } centralized error +158 . × G LTµ ( p ( T ) ( δ, ω )) S T | {z } noiseless consensus error + 158 . A ( δ,ω ) K q ˜ N ,T LTµ ( p ( T ) ( δ, ω )) S T + 1 K D ( δ, ω ) q ˜ N ,T | {z } AWGN error , (36)where D ( δ, ω ) = ω d µ . ( ζ ( δ, ω ) a ′ ) . Proof:

Please refer to Appendix F.

Remark 5.2:

Similar to analysis given in Section IV, we decompose the upper bound on theoptimality gap (36) into different terms. The “centralized error” carries the same meaning as thatin (26); the “noiseless consensus error” quantiﬁes the disagreement among agents in the absenceof communication noise, i.e., ˜ N ( t )0 i = 0 (cf. (29)); and the “AWGN error” accounts for the impactof Gaussian noise on the consensus updates in (23). To gain insights into how wireless resources,channel conditions, and topology of the connectivity graph affect the performance of the analogwireless implementation, we can write (36) in the following form E [ F (˜ θ T )] − F ∗ ≤ O q ˜ N ,T K ! + O ¯ σ µKT + (cid:18) G + A ( δ,ω ) K q ˜ N ,T (cid:19) Lµ ( p ( T ) ( δ, ω )) T ! + O (cid:18) ¯ σ µKT (cid:19) + O (cid:16) µT (cid:17) . (37)Compared with (27), the upper bound (37) reveals that, even if T → ∞ , there is a non-vanishingterm O (cid:18) √ ˜ N ,T K (cid:19) , as well as a term scaling as O (cid:18) √ ˜ N ,T ( p ( T ) ( δ,ω )) T (cid:19) that may not vanish either. We also note that ˜ N ,T is non-decreasing with T by its deﬁnition (34). This highlights the signiﬁcantimpact of the topology parameters δ and β , as well as of the quality of the reconstruction ω = m / d via function p ( T ) ( δ, ω ) . Furthermore, as the effective noise power ˜ N ( t )0 i in (29) for i ∈ V decreases with the transmitting power P ( t ) , the convergence rate improves with P ( t ) .In particular, when P ( t ) → ∞ , we have ˜ N ,T = 0 , and therefore (37) reduces exactly to thecorresponding expression for the noiseless case in (27). A. Numerical Illustration

In this subsection, we elaborate on the results obtained from the analysis above by followingthe approach in Section IV-A. For the scheduling scheme, we apply the vertex coloring-basedstrategy proposed in [1]. Fig. 6 plots the upper bounds and the individual terms in (36) as in Fig.4 for digital communication under the planar grid topology with torus wrapping. The centralizederror and the noiseless consensus error are independent of SNR. Note that the latter depends onthe parameter ω = N / Md that quantiﬁes the quality of the estimate in (9), which is independentof the SNR. As a result, the impact of SNR on the optimality gap is only through the AWGNerror, which dominates the other terms when the SNR level is sufﬁciently small, here around dB, while it becomes negligible when the SNR is large enough, here lager than dB. 5HFHLYHG615G% í í í í 1 R U P D OL ]H G XSS H U ERXQG RQ  [ F ( ̃ θ θ ̃] − F ∗ RSWLPDOLW\JDSQRLVHOHVVFRQVHQVXVHUURUFHQWUDOL]HGHUURU $:*1HUURU (a) t = 2000 . 5HFHLYHG615G% í í í í 1 R U P D OL ]H G XSS H U ERXQG RQ  [ F ( ̃ θ θ ̃] − F ∗ RSWLPDOLW\JDSQRLVHOHVVFRQVHQVXVHUURUFHQWUDOL]HGHUURU $:*1HUURU (b) t = 5000 .Fig. 6. Normalized upper bounds (36) on the optimality gap versus the received SNR levels for a planar grid graph with toruswrapping and N = 10 ( a = 7 × , a ′ = a q ˜ N ,T , and ˜ N ,T = 10 − ). Next, we study the impact of the topology of the connectivity graph on the convergence foranalog transmission in Fig. 7. The general conclusions are analogous to Fig. 5, which illustratesthe corresponding results for digital communication. In particular, the noiseless consensus error is shown to increase when the connectivity graph is less densely connected, i.e., with smaller δ . In contrast, the AWGN error is less sensitive to a change in connectivity, and it becomesdominant for sufﬁciently large number of iterations such as t = 5000 . &* SODQDUJULGZZUDS SODQDUJULG VWDU 1 R U P D OL ] HG XSSH U ERXQG RQ  [ F ( ̃ θ θ ̃] − F ∗ RSWLPDOLW\JDS$:*1HUURUQRLVHOHVVFRQVHQVXVHUURUFHQWUDOL]HGHUURU Fig. 7. Normalized upper bounds (36) on the optimality gap for different topologies of the connectivity graph with M = K ,SNR = 30 dB, and N = 10 for t = 5000 ( a = 3 × , a ′ = 2 a q ˜ N ,T , and ˜ N ,T = 10 − corresponding to dB receivedSNR). VI. N

UMERICAL E XPERIMENTS

In this section, we corroborate the analysis developed in Sections IV and V by evaluatingthe empirical performance of the digital and analog wireless implementations over a wirelessD2D network. We consider the learning task of image classiﬁcation over the Fashion-MNISTdataset [35] that consists of × images divided into C = 10 classes. There are , training data samples and , test data samples, which are equally divided among classes.Each device i ∈ V has data samples from at least six different classes, with the number ofmissing classes being uniformly selected in the set { , , , , } . An equal number x i of datasamples are then selected across the available classes at device i , such that the total number P i ∈V P n =1 n,i x i of training samples in use are maximized, where n,i is an indicator functiondenoting whether class n ∈ { , . . . , } is available at device i ( n,i = 1 ) or not ( n,i = 0 ).All devices share a softmax regression model. We adopt the standard cross-entropy loss with ℓ regularization f i ( θ ) = − |D i | P ξ ∈D i P n ∈ Ω i y i,n log(softmax n ( a i ( θ , ξ ))) + µ k θ k , where Ω i denotes the set of available classes on device i ∈ V ; y i,n ∈ { , } is the one-hot encoded labelcorresponding to the n th class for data sample ξ ∈ D i ; the n the entry of softmax is deﬁned as softmax n ( z ) = e zn P Cn =1 e zn , where z = ( z , . . . , z C ) T ; and a i ( θ , ξ ) is the vector consisting ofthe logits for device i with its n th entry corresponding to class n . The SGD is executed withmini-batch size of |D ( t ) i | = 64 , and we add momentum to all updates with a factor of . . Inline with the local empirical risk function deﬁned above, the strong-convexity parameter is setas µ = 0 . . The smoothness factor L is numerically computed as the largest eigenvalue of thedata Gramian matrix.We adopt the vertex coloring-based scheduling schemes used in [1]. The optimal value F ∗ usedfor quantifying the empirical optimality gap is numerically obtained by the standard decentralizedSGD (applying Algorithm 1 with C = D = I for sufﬁciently large T ), with the hyper parameters b > and c > for the learning rate η ( t ) = bt + c , as well as the (constant) consensus stepsize ζ , optimized via grid search. We consider K = 20 devices with the connectivity graphmodelled as a planar × grid graph with torus wrapping or a chain graph. All the otherparameters for simulations are set as in Section IV-A unless speciﬁed otherwise. As benchmarks,we consider decentralized learning with ideal communications, i.e., applying Algorithm 1 with C ( t ) = D ( t ) = I , as well as independent learning that carries out training based solely on localdata with no communications among devices.We start by studying the impact of the proposed adaptive consensus rate on the convergencefor analog transmission. We recall that the analysis in Section V has revealed that a consensusrate ζ ( t ) = ζ t /1000+1 is necessary in order to ensure convergence. In line with the analysis, Fig. 8shows that the optimality gap with ﬁxed consensus rates diverges as a function of the numberof iteration t , while it converges with the proposed adaptive consensus rate. The ﬁgure alsodemonstrates the advantages brought by communications with respect to (w.r.t) local trainingwith no communications.Next, we provide a performance comparison between digital and analog wireless implementa-tions in terms of the analytical and the empirical upper bounds on the optimality gap. We plot theoptimality gap normalized by its value evaluated at iteration t = 200 and t = 0 for the analyticaland the empirical results, respectively. For the empirical results, we set η ( t ) = . µ t +200 for allschemes; ζ ( t ) = 0 . for the schemes of ideal, digital and no communication; and ζ ( t ) = . t / d +1 with d = 35355 , , , , and for N = 500 , , , , and ,respectively, for the analog implementation. Fig. 9 compares digital and analog implementations,benchmarked by ideal communication and no communication. The analytical bounds shown inFig. 9(a) are in general agreement with the empirical results shown in Fig. 9(b), demonstrating 1XPEHURILWHUDWLRQVt 2 S WL P D OLW \ J D S F ( ̄ θ ( t ̄ ̄ − F ∗ θ = 0∗06 θ = 0∗05LGHDOFRPPXQ DGDSWLYHFRQVHQVXVUDWHIL[HGFRQVHQVXVUDWHQRFRPPXQ Fig. 8. Empirical optimality gap versus the number of iterations t for analog transmission over a planar grid graph with toruswrapping, with SNR = 30 dB and N = 8000 ( a = 200 ). the practical relevance of the theory developed in this paper. In particular, the analysis correctlypredicts the advantages of the analog implementation for sufﬁciently small number of channeluses N , and the marginal beneﬁts of the digital implementation in the complementary regimeof a large number of channel uses, e.g., when we have N ≥ . 1XPEHURIFKDQQHOXVHVN 1 R U P D OL ]H G XSS H U ERXQG RQ  [ F ( ̃ θ θ ̃] − F ∗ LGHDOFRPPXQ GLJLWDO DQDORJ QRFRPPXQ (a) Normalized upper bounds on the optimality gap for differentschemes. 1XPEHURIFKDQQHOXVHVN 1 R U P D OL ]H G RS WL P D OLW \ J D S RQ  [ F ( ̄ θ ( θ ̄ ̄] − F ∗ LGHDOFRPPXQ GLJLWDO DQDORJ QRFRPPXQ (b) Normalized empirical optimality gap for different schemes.Fig. 9. Performance comparison of analog and digital implementations in terms of normalized (upper bounds on) optimalitygap for iteration t = 2500 for different number N of channel uses with SNR set to dB over a chain graph. VII. C

ONCLUSIONS

This paper has studied the communication-efﬁcient implementations of DSGD algorithmsfor wireless FL in fully decentralized architectures. Speciﬁcally, we have proposed genericdigital and analog transmission protocols tailored to FL over wireless D2D networks by en-abling broadcasting for digital transmission, and both broadcasting and AirComp for analogtransmission. We also adopted a practically favourable linear compression scheme, RLC, toreduce communication burden, and implemented a consensus step size adaptive to the trainingiteration. For both implementations, we developed rigorous analysis framework in term of theirconvergence properties, characterizing the impact of the connectivity topology, the quality oftransmission, and/or the channel noise on the optimality gap. Empirical experiments on animage-classiﬁcation task veriﬁed the analytical results as well as the importance of an adaptiveconsensus step size. A

PPENDIX AS CHEDULING FOR D IGITAL T RANSMISSION

At iteration t , each device i ∈ V transmits using a single time slot of the communicationblock in Fig. 2. The subsets of devices scheduled at each of the M slots are such that: (i) no twoconnected devices transmit in the same slot due to the half-duplex transmission constraints; and (ii) no two devices connected to the same device transmit in the same slot, so as not to causeinterference at their common neighbor. To design a scheduling scheme with these properties,we construct an auxiliary graph G d ( V , E d ) such that the edge set E d ⊇ E includes not onlythe original edges in E , but also one edge for each pair of nodes sharing one or more commonneighbors. We then carry out vertex coloring on the auxiliary graph G d ( V , E d ) , such that any twonodes connected by an edge are assigned with distinct “colors”. The minimum number of colorsrequired is the chromatic number χ ( G d ) of graph G d ( V , E d ) . Scheduling proceeds by assigningthe nodes with the same “color” to the same slot. This ensures that both requirements (i) and (ii) above are satisﬁed. Transmitting nodes then broadcast to all their respective neighbors. Werefer to Fig. 10 for an example.Since ﬁnding the chromatic number χ ( G d ) for a general graph is an NP-hard problem, weadopt the well-known greedy algorithm, which has complexity bounded by O ( |V| + |E d | ) . Thenumber M of time slots is henceforth set, for digital transmissions, to equal the number of colorsobtained by the greedy algorithm. We recall that the number of colors returned by the greedy Algorithm 2:

Digital Wireless Implementation

Input :

Consensus step size ζ ( t ) , SGD learning step size η ( t ) , connectivity graph G ( V , E ) and mixing matrix W Initialize at each node i ∈ V : θ (0) i , ˆ θ (0) j = , ∀ j ∈ N i S { i } ; for t = 0 , , . . . , T − do for each device i ∈ V do in parallel update θ ( t +1/2) i = θ ( t ) i − η ( t ) ˆ ∇ f i ( θ ( t ) i ) ; end for slot s = 1 , . . . , M do for each scheduled device i at slot s do in parallel broadcast Q b (cid:16) A ( t ) i ( θ ( t +1/2) i − ˆ θ ( t ) i ) (cid:17) ; end for each neighboring device j ∈ N i do in parallel receive Q b (cid:16) A ( t ) j ( θ ( t +1/2) j − ˆ θ ( t ) j ) (cid:17) ; end end for each device i ∈ V do in parallel update (11) and (3). end endOutput: θ ( T − i , ∀ i ∈ V algorithm is always smaller than ∆( G d ( V , E d )) + 1 , where ∆( · ) denotes the maximum degreeof a graph [36, algorithm G]. A PPENDIX BS CHEDULING FOR A NALOG T RANSMISSION

In order to leverage AirComp in a general D2D topology, we propose a sequential schedulingpolicy that aims at selecting as many non-interfering star-based sub-networks as possible in eachpair of time slots. Algorithm 3:

Analog Wireless Implementation

Input :

Consensus step size ζ ( t ) , SGD learning step size η ( t ) , connectivity graph G ( V , E ) , mixing matrix W , transmission slots M = 2 n , and subsets {S BT i , S BR i , S AT i , S AR i } , i ∈ V Initialize at each node i ∈ V : θ (0) i , ˆ θ (0) j = , ∀ j ∈ N i ∪ { i } , and ˆ y (0) i = ; for t = 0 , , . . . , T − do perform Line - in Algorithm 2; for slot s = 1 , , . . . , n − do AirComp transmission for each node i in { i ∈ V | s ∈ S AT i } do in parallel transmit to a center node j (12); end for each center node j in { j ∈ V | s ∈ S AR j } do in parallel receive (13) and estimate (14); end end for slot s = 2 , , . . . , n do BC transmission for each node i in { i ∈ V | s ∈ S BT i } do in parallel broadcast (15); end for each node j in { j ∈ V | s ∈ S BR j } do in parallel receive (16) and estimate (17) ; end end for each device j ∈ V do in parallel update (18), (22) and (23). end endOutput: θ ( T − i , ∀ i ∈ V To this end, we ﬁrst carry out the greedy coloring algorithm described in Appendix A on theoriginal connectivity graph G (1) = G ( V , E ) . We deﬁne d (1) c as the sum of the degrees of all nodes ! " !" !" ! " !" !" Fig. 10. An example illustrating the scheduling policy for digital transmissions in the connectivity graph of Fig. 1 ( M = 5 ). that have been assigned the same color c . Next, we set all nodes assigned the degree-maximizingcolor c ∗ = arg max { d (1) c } in G (1) as the center nodes, which compose the set N (1) c . In the ﬁrstslot of the ﬁrst pair of slots, the nodes in N (1) c receive combined signals transmitted by theirneighbors in G (1) ; while, in the subsequent second slot, the same set of nodes in N (1) c servebroadcast transmitters. Hence, for each center node i ∈ N (1) c , the ﬁrst slot of the pair is in set S AR i and in sets S AT j for all nodes j ∈ N i . Conversely, the second slot is in set S BT i and insets S BR j for all nodes j ∈ N i . The nodes in N (1) c and their connected edges, along with anynodes disconnected from G (1) , are then removed from G (1) to produce the residual graph G (2) .The procedure above is repeated on this residual graph G (2) to schedule transmissions in the nextpair of slots. The overall procedure is repeated until the residual graph becomes empty, and issummarized in Algorithm 4. The number M of time slots is henceforth given by M = 2 n .An example can also be found in Fig. III-B. In each subﬁgure, only the nodes and the edgesactively involved in the transmissions of the associate slot are shown. For example, although theset of neighbors of node is N = { , , , } throughout the whole training session, node isnot included as an active receiver in slot , since it has became inactive when being “removed”from G (1) at the end of slot . Furthermore, when an active transmission in the residual graphreduces to a pair of point-to-point transmission (e.g., the pair of nodes (1 , in slot ), there isno difference between broadcast and Aircomp transmission. In this case, we arbitrarily assignone of the paired node (e.g., node ) to an odd-number slot (e.g., slot ), while assign the othernode (e.g., node ) to the subsequent even-number slot (e.g., slot ). Algorithm 4:

Scheduling Policy for Analog Transmission

Input : n = 0 , graph G (1) = G ( V , E ) Initialize for each node i ∈ N i : S AT i = S AR i = S BT i = S BR i = ∅ ; repeat update n = n + 1 ; color the graph G ( n ) using the greedy algorithm described in Appendix A; foreach center node i ∈ N ( n ) c do update S AR i ← S AR i S { n − } ; update S BT i ← S BT i S { n } ; end foreach center node i ’s neighbors j ∈ N i in graph G ( n ) do update S AT j ← S AT j S { n − } ; update S BR j ← S BR j S { n } ; end update V = V \ (cid:0) N ( n ) c S { disconnected node from G ( n ) } (cid:1) ; update E = E \ (cid:0) S i ∈N ( n ) c (cid:0) S j ∈N i { ( i, j ) } (cid:1)(cid:1) ; update graph G ( n +1) = G ( V , E ) . until G ( n ) = ∅ ; Output: M = 2 n , {S AT i , S AR i , S BT i , S BR i } , i ∈ V A PPENDIX CP ROOF OF L EMMA A = √ m HR , it follows that AA T = m HRRH T = m HH T = dm I . In addition, by denoting the l th row of the partial Hadamard We will omit the superscript ( · ) ( t ) whenever it does not cause ambiguity throughout the appendices. matrix H by h Tl ∈ R × d , l = 1 , . . . , m , E [ A T A ] is given by E [ A T A ] = 1 m E [ R [ h , . . . , h m ][ h , . . . , h m ] T R ]= 1 m E [[ Rh , . . . , Rh m ][ Rh , . . . , Rh m ] T ]= 1 m m X l =1 E [( Rh l )( Rh l ) T ]= 1 m m X l =1 E  r h l, . . . r r d h l, h l,d ... . . . ... r r d h l, h l,d . . . r d h l,d  ( a ) = I , (38)where h l,n denotes the n th entry of h l , n = 1 , . . . , d ; and ( a ) is due to the fact that r n h l,n = 1 for all l = 1 , . . . , m , as well as E [ r n r n ] = 0 for all n = n . Then given the number of rowsof A , i.e., m , ﬁxed, for all u ∈ R d × , it follows that E (cid:13)(cid:13)(cid:13) u − md A T Au (cid:13)(cid:13)(cid:13) = E (cid:20) u T u − md u T A T Au + (cid:16) md (cid:17) u T A T AA T Au (cid:21) ( a ) = k u k − md u T E [ A T A ] u ( b ) = (cid:16) − md (cid:17) k u k , (39)where ( a ) and ( b ) come from AA T = dm I and (38), respectively. According to the law ofiterated expectation, when m ( t ) i , i ∈ V , also varies with channel state at the t th iteration, wehave, for A = m i HR ∈ R m i × d E (cid:13)(cid:13)(cid:13) u − m i d ( A ) T Q b ( Au ) (cid:13)(cid:13)(cid:13) = E m i (cid:20) E A | m i = n (cid:13)(cid:13)(cid:13) u − m i d A T Q b ( Au ) (cid:13)(cid:13)(cid:13) (cid:21) = d X n =1 Pr( m i = n ) E A | m i = n (cid:13)(cid:13)(cid:13) u − m i d A T Q b ( Au ) (cid:13)(cid:13)(cid:13) | {z } part I . (40)For part I, since E A | m i = n (cid:13)(cid:13)(cid:13) u − m i d A T Q b ( Au ) (cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13)(cid:16) u − nd A T Au (cid:17) + (cid:16) nd A T Au − nd A T Q b ( Au ) (cid:17)(cid:13)(cid:13)(cid:13) a ) = E (cid:20)(cid:13)(cid:13)(cid:13) u − nd A T Au (cid:13)(cid:13)(cid:13) + 2 nd (cid:16) u − nd A T Au (cid:17) T A T U Au + n d (cid:13)(cid:13) A T U Au (cid:13)(cid:13) (cid:21) ( b ) = E (cid:13)(cid:13)(cid:13) u − nd A T Au (cid:13)(cid:13)(cid:13) + n d E (cid:13)(cid:13) A T U Au (cid:13)(cid:13) c ) = (cid:16) − nd + nd ǫ b (cid:17) k u k , (41) where Au − Q b ( Au ) = U Au is adopted for ( a ) as per IEEE 754 standard, in which U ∈ R n × n is a diagonal matrix with its i th entry [ U ] ii , u i , i = 1 , . . . , n ; ( b ) is due to the fact AA T = dn I ;and ( c ) follows from (39) as well as E [ A T U A ] = n P ni =1 | u i | I and | u i | < ǫ b . In addition,considering negligibly little value of ǫ b ( ǫ b = 2 − for b = 32 and ǫ b = 2 − for b = 52 ) [37, ch.2], we thus safely approximate E A | m i = n k u − m i d A T Q b ( Au ) k by (1 − nd ) k u k .Next, there remains Pr( m i = n ) to calculate, i ∈ V . We assume that there is at least one rowfor matrix A ( t ) i however adverse the channel condition is, i.e., Pr( m i = 1) = Pr( B i / b < , andthat there are no more than d rows for A ( t ) i even if the channel condition can support more than d rows, i.e. Pr( m i = d ) = Pr( B i / b ≥ d ) . Furthermore, for n = 2 , . . . , d − , it is easy to showthat Pr( m i = n ) = Pr( n ≤ B i / b < n + 1)= Pr (cid:18) min j ∈N i | h ′ ij | < N P M (cid:16) ( n +1) b MN − (cid:17)(cid:19) − Pr (cid:18) min j ∈N i | h ′ ij | < N P M (cid:16) nb MN − (cid:17)(cid:19) ( a ) = G i ( n ) − G i ( n + 1) , (42)where the deﬁnition of function G i ( · ) is given by (25), and ( a ) follows from the fact that | h ′ ij | , j ∈ N i , are exponentially and independently distributed with mean value A ( d d ij ) r such that Pr(min j ∈N i | h ′ ij | < x ) = 1 − exp( − xA P j ∈N i ( d ij d ) r ) . As a result, we have Pr( m i = n ) =  − G i (2) , n = 1 ,G i ( n ) − G i ( n + 1) , n = 2 , . . . , d − ,G i ( d ) , n = d. (43)Finally, Plugging (41) and (43) into (40), we arrive at (24) by constructing telescoping sums.A PPENDIX DP ROOF OF L EMMA ˆ y ( t +1) j = ˆ y ( t ) j + md ( A ( t ) ) T A ( t ) (cid:16) X s ∈S AR j X i ∈N ( s ) j w ji ( θ ( t +1/2) i − ˆ θ ( t ) i ) + X s ∈S BR j w ji s ( θ ( t +1/2) i s − ˆ θ ( t ) i s ) (cid:17) + md ( A ( t ) ) T ˜ n ( t ) j , (44) in which ˜ n ( t ) j is the effective noise deﬁned as ˜ n ( t ) j = X s ∈S AR j ℜ{ n ( t,s ) j } q γ ( t,s ) j + X s ∈S BR j w ji s q α ( t,s ) i s ℜ ( n ( t,s ) j h ′ ( t ) i s j ) , (45)which is also Gaussian with zero mean and the covariance matrix calculated as E [ ˜ n ( t ) j ( ˜ n ( t ) j ) T ] = X s ∈S AR j γ ( t,s ) j E [ ℜ{ n ( t,s ) j } ( ℜ{ n ( t,s ) j } ) T ] + X s ∈S BR j w ji s α ( t,s ) i s E (cid:20) ℜ n n ( t,s ) j h ′ ( t ) i s j o(cid:16) ℜ n n ( t,s ) j h ′ ( t ) i s j o(cid:17) T (cid:21) ( a ) = N (cid:18) X s ∈S AR j γ ( t,s ) j + X s ∈S BR j w ji s α ( t,s ) i s (cid:12)(cid:12) h ′ ( t ) i s j (cid:12)(cid:12) (cid:19) I , (46)where ( a ) is due to the facts that the entries n ( t,s ) j are i.i.d. with each denoted by n ( t,s ) ji ∼CN (0 , N ) , and the real and the image part of n ( t,s ) ji are also independent Gaussian randomvariables with zero mean and variance N , thus leading to P s ∈S BR j w jis α ( t,s ) is E [ ℜ{ n ( t,s ) j h ′ ( t ) isj } ( ℜ{ n ( t,s ) j h ′ ( t ) isj } ) T ] = N | h ′ ( t ) isj | I . Then specifying α ( t,s ) i s and γ ( t,s ) j by (21) and (20), respectively, we obtain (29).Furthermore, the scheduling scheme in 4 suggests that each node is scheduled to receive asa center node for at most one slot (cf. Line 13 in Algorithm 4), i.e., |S AR j | ≤ , while it maybe scheduled to receive from a broadcast Tx for multiple slots. In all cases, the estimate update(44) always aggregates model parameters from all of node j ’s neighbors. Hence, (44) can besimpliﬁed as ˆ y ( t +1) j = ˆ y ( t ) j + md ( A ( t ) ) T A ( t ) X i ∈N j w ji u ( t ) i + md ( A ( t ) ) T ˜ n ( t ) j , (47)where θ ( t +1/2) i − ˆ θ ( t ) i , u ( t ) i . By recursively applying (47), it follows, for any device i ∈ V , that ˆ y ( t +1) i = ˆ y (0) i + t X τ =0 md ( A ( τ ) ) T A ( τ ) X j ∈N i w ij u ( τ ) j ! + t X τ =0 md ( A ( τ ) ) T ˜ n ( τ ) i . (48)On the other hand, applying weighted sum over j ∈ N i on both sides of (22), it follows that X j ∈N i w ij ˆ θ ( t +1) j = X j ∈N i w ij ˆ θ ( t ) j + md ( A ( t ) ) T A ( t ) X j ∈N i w ij u ( t ) j . (49)Recursively applying (49) leads to X j ∈N i w ij ˆ θ ( t +1) j = X j ∈N i w ij ˆ θ (0) i + t X τ =0 md ( A ( τ ) ) T A ( τ ) X j ∈N i w ij u ( τ ) j ! . (50) Then, combining (48) and (50) with the fact ˆ y (0) i = P j ∈N i w ij ˆ θ (0) i = , we have ˆ y ( t +1) i = X j ∈N i w ij ˆ θ ( t +1) j + t X τ =0 md ( A ( τ ) ) T ˜ n ( τ ) i . (51)By substituting (51) for ˆ y ( t +1) i in device i ’s consensus update for analog implementation, (23)can be recast as (28). A PPENDIX EP ROOF OF L EMMA

Lemma E.1 (Variant of [21, Lemma 17]):

Denoting [ θ ( t )1 , . . . , θ ( t ) K ] ∈ R d × K , [ˆ θ ( t )1 , . . . , ˆ θ ( t ) K ] ∈ R d × K , [¯ θ ( t ) , . . . , ¯ θ ( t ) ] ∈ R d × K and [ ˜ n ( t )1 , . . . , ˜ n ( t ) K ] ∈ R d × K , by Θ ( t ) , ˆ Θ ( t ) , ¯ Θ ( t ) and ˜ N ( t ) ,respectively, then for consensus step size ζ ( t ) ≥ , mixing matrix W and any parameter α > ,on average over RLC and AWGN, we have E k Θ ( t +1) − ¯ Θ ( t +1) k F ≤ (1 + α ) (cid:0) − δζ ( t ) (cid:1) E k Θ ( t +1/2) − ¯ Θ ( t +1/2) k F +(1 + α − ) β ( ζ ( t ) ) E k Θ ( t +1/2) − ˆ Θ ( t +1) k F + ( ζ ( t ) ) ω d t X τ =0 ˜ N ( τ )0 , (52)where ˜ N ( τ )0 = P i ∈V ˜ N ( τ )0 i is the sum-variance of effective noise over all devices. Proof:

With the fact ( W − I ) T / K = , rewrite (28) and (30) in matrix forms as Θ ( t +1) = Θ ( t +1/2) + ζ ( t ) ˆ Θ ( t +1) ( W − I ) + ζ ( t ) ω P tτ =0 ( A ( τ ) ) T ˜ N ( τ ) and ¯ Θ ( t +1) = ¯ Θ ( t +1/2) + ζ ( t ) ω P tτ =0 ( A ( τ ) ) T ˜ N ( τ ) T / K , respectively. Then, we have E k Θ ( t +1) − ¯ Θ ( t +1) k F = E (cid:13)(cid:13)(cid:13) Θ ( t +1/2) − ¯ Θ ( t +1/2) + ζ ( t ) ˆ Θ ( t +1) ( W − I ) + ζ ( t ) ω t X τ =0 ( A ( τ ) ) T ˜ N ( τ ) (cid:16) I − T K (cid:17)(cid:13)(cid:13)(cid:13) F ( a ) = E (cid:13)(cid:13)(cid:13) Θ ( t +1/2) − ¯ Θ ( t +1/2) + ζ ( t ) ˆ Θ ( t +1) ( W − I ) (cid:13)(cid:13)(cid:13) F + ( ζ ( t ) ) ω E (cid:13)(cid:13)(cid:13) t X τ =0 ( A ( τ ) ) T ˜ N ( τ ) (cid:16) I − T K (cid:17)(cid:13)(cid:13)(cid:13) F ( b ) ≤ E (cid:13)(cid:13)(cid:13) ( Θ ( t +1/2) − ¯ Θ ( t +1/2) )( I + ζ ( t ) ( W − I )) + ζ ( t ) ( ˆ Θ ( t +1) − Θ ( t +1/2) )( W − I ) (cid:13)(cid:13)(cid:13) F + ( ζ ( t ) ) ω d (cid:18) − K (cid:19) t X τ =0 ˜ N ( τ )0 ( c ) ≤ RHS of (52) , where ( a ) is due to the expectation taken over Gaussian noise and RLC, i.e., E [( A ( τ ) ) T ˜ N ( τ ) ] , iszero; the ﬁrst term in ( b ) is as a result of ¯ X ( W − I ) = for any X ∈ R d × K ; the second termin ( b ) is based on A ( τ ) ( A ( τ ) ) T = dm I and i.i.d. entries of ˜ n ( τ ) i such that E k ˜ n ( τ ) i k = m ˜ N ( τ )0 i with P i ∈V ˜ N ( τ )0 i , ˜ N ( τ )0 ; and ( c ) follows [21, Lemma 17]. Lemma E.2 (Variant of [21, Lemma 18]):

In addition to Θ ( t ) , ˆ Θ ( t ) , ¯ Θ ( t ) and ˜ N ( t ) denoted asin Lemma E.1, denoting [ ˆ ∇ f ( θ ( t )1 ) , . . . , ˆ ∇ f K ( θ ( t ) K )] ∈ R d × K by ˆ ∇ F ( t ) , then for consensus stepsize ζ ( t ) ≥ , mixing matrix W and any parameter α > , on average over RLC and AWGN,we have E k Θ ( t +3/2) − ˆ Θ ( t +2) k F ≤ (1 + α − )(1 − ω ) β ( ζ ( t ) ) E k Θ ( t +1/2) − η ( t +1) ˆ ∇ F ( t +1) − ¯ Θ ( t +1/2) k F +(1 + α )(1 − ω ) (cid:0) βζ ( t ) (cid:1) E k Θ ( t +1/2) − η ( t +1) ˆ ∇ F ( t +1) − ˆ Θ ( t +1) k F +( ζ ( t ) ) (1 − ω ) ω d t X τ =0 ˜ N ( τ )0 . (53) Proof:

Given the constant ω = m / d in analog implementation, (22) implies that, for i ∈ V , ˆ Θ ( t +1) i = ˆ Θ ( t ) i + ω ( A ( t ) ) T A ( t ) ( Θ ( t +1/2) i − ˆ Θ ( t ) i ) . (54)Therefore, substituting (54) for ˆ Θ ( t +2) , if follows that E k Θ ( t +3/2) − ˆ Θ ( t +2) k F = E (cid:13)(cid:13)(cid:13) Θ ( t +3/2) − (cid:0) ˆ Θ ( t +1) i + ω ( A ( t +1) ) T A ( t +1) ( Θ ( t +3/2) i − ˆ Θ ( t +1) i ) (cid:1)(cid:13)(cid:13)(cid:13) F = E (cid:13)(cid:13)(cid:13) ( Θ ( t +3/2) − ˆ Θ ( t +1) i ) − ω ( A ( t +1) ) T A ( t +1) ( Θ ( t +3/2) i − ˆ Θ ( t +1) i ) (cid:13)(cid:13)(cid:13) F ( a ) =(1 − ω ) E k Θ ( t +3/2) − ˆ Θ ( t +1) k F ( b ) = (1 − ω ) E k Θ ( t +1) − η ( t +1) ˆ ∇ F ( t +1) − ˆ Θ ( t +1) k F ( c ) = (1 − ω ) E (cid:13)(cid:13)(cid:13) Θ ( t +1/2) − η ( t +1) ˆ ∇ F ( t +1) + ζ ( t ) ˆ Θ ( t +1) ( W − I ) − ˆ Θ ( t +1) + ζ ( t ) ω t X τ =0 ( A ( τ ) ) T ˜ N ( τ ) (cid:13)(cid:13)(cid:13) F ( d ) ≤ (1 − ω ) (cid:16) E (cid:13)(cid:13) Θ ( t +1/2) − η ( t +1) ˆ ∇ F ( t +1) + ζ ( t ) ˆ Θ ( t +1) ( W − I ) − ˆ Θ ( t +1) (cid:13)(cid:13) F + E (cid:13)(cid:13) ζ ( t ) ω t X τ =0 ( A ( τ ) ) T ˜ N ( τ ) (cid:13)(cid:13) F (cid:17) ( e ) ≤ (1 − ω ) (cid:16) (1 + α − ) β ( ζ ( t ) ) E k Θ ( t +1/2) − η ( t +1) ˆ ∇ F ( t +1) − ¯ Θ ( t +1/2) k F +(1 + α ) (cid:0) βζ ( t ) (cid:1) E k Θ ( t +1/2) − η ( t +1) ˆ ∇ F ( t +1) − ˆ Θ ( t +1) k F + ( ζ ( t ) ) ω d t X τ =0 ˜ N ( τ )0 (cid:17) , where ( a ) is based on (39); ( b ) executes one SGD step (cf. (1)); ( c ) is due to the analog consensusupdate (cf. (28)); ( d ) is as a results of zero-mean AWGN; and ( e ) follows [21, Lemma 18]combining with the noise variance as derived similarly in the proof for Lemma E.2.We elaborate on non-trivial modiﬁcations made based on [21, Appendix C] in the sequel. First, we need to modify two auxiliary functions to be adaptive as below: η ( ζ ( t ) ) =(1 + α )(1 − δζ ( t ) ) + (1 + α − )(1 − ω ) β ( ζ ( t ) ) , (55) ξ ( ζ ( t ) ) =(1 + α − ) β ( ζ ( t ) ) + (1 + α )(1 − ω ) (cid:0) βζ ( t ) (cid:1) , (56)and introduce another auxiliary function deﬁned as η ( x ) = (cid:18) δ ω β (cid:19) x − δx + 1 = 1 − ˜ p ( x ) , (57)where ˜ p ( x ) = δx − (cid:18) δ ω β (cid:19) x . (58)According to the deﬁnition of function η ( x ) , which is a quadratic convex function decreasingover [0 , x ∗ ] with x ∗ = min x η ( x ) , it follows that ˜ p ( x ) increases over [0 , x ∗ ] .Next, we substitute ζ ( t ) = ζ √ ˜ N t / a ′ +1 for x in ˜ p ( x ) to obtain function ˜ p ( t ) : R + R in(33). Note that since ζ given in Theorem 4.1 is given by ζ = λ ′ x ∗ for a speciﬁc choice λ ′ = β + δ ω δ + δ +4 β +2 δβ − δω ) ∈ (0 , (see [21, (20), (24)] for detail), it follows that ζ ( t ) ≤ ζ (0) = ζ ≤ x ∗ . As we establish the facts that ζ ( t ) ∈ [0 , x ∗ ] decreases over t ≥ and ˜ p ( x ) increases over [0 , x ∗ ] , ˜ p ( t ) turns out to be decreasing over t ≥ according to the composition of monotonicfunctions. Furthermore, based on the relation of η ( x ) ≤ η ( x ) derived in [21, Apppendix C],which holds for any x and ω > , since η ( ζ ) ≤ η ( ζ ) = 1 − ˜ p ( ζ ) (33) = 1 − ˜ p (0) [21, (24)] ≤ − p ( δ, ω ) , (59)it implies that p ( δ, ω ) ≤ ˜ p (0) .In addition, we have η ( ζ ( t ) ) ≤ η ( ζ ( t ) ) = 1 − ˜ p ( ζ ( t ) ) (33) = 1 − ˜ p ( t ) , (60) We will omit the dependence of functions on δ and ω but the iteration index t throughout the appendices, as long as it doesnot cause any ambiguity in the context. and ξ ( ζ ( t ) ) ( a ) ≤ ξ ( ζ ) [21, (26)] ≤ − p ( δ, ω ) , (61)where ( a ) is due to the fact that function ξ ( ζ ( t ) ) increases over ζ ( t ) > and ζ ( t ) ≤ ζ . Finally,combining (60) and (61), we arrive at the following variant of [21, (21)] max (cid:8) η ( ζ ( t ) ) , ξ ( ζ ( t ) ) (cid:9) ≤ max (cid:8) − ˜ p ( t ) , − p ( δ, ω ) (cid:9) = 1 − min (cid:8) ˜ p ( t ) , p ( δ, ω ) (cid:9) (32) = 1 − p ( t ) . (62)Now, we are ready to derive the recursive upper-bound given by (35). The following basicinequalities are frequently recalled in the sequel, which are, for two given matrices X and Y of the same size, k X + Y k F ≤ (1 + α ) k X k F + (1 + α − ) k Y k F , ∀ α > , (63)and for another square matrix Z , k XZ k F ≤ k X k F k Z k . (64)By applying one SGD step (cf. (1)), it follows that, for any given α > , E k Θ ( t +1/2) − ¯ Θ ( t +1/2) k F = E (cid:13)(cid:13)(cid:13) ( Θ ( t ) − ¯ Θ ( t ) ) − η ( t ) ˆ ∇ F ( t ) (cid:16) I − T K (cid:17)(cid:13)(cid:13)(cid:13) F (63) , (64) ≤ (1 + α − ) E k Θ ( t ) − ¯ Θ ( t ) k F + (1 + α )( η ( t ) ) E k ˆ ∇ F ( t ) k F (cid:13)(cid:13)(cid:13) I − T K (cid:13)(cid:13)(cid:13) (7) ≤ (1 + α − ) E k Θ ( t ) − ¯ Θ ( t ) k F + (1 + α ) K ( η ( t ) ) G . (65)Similarly, for the same choice of α , it is also true that E k Θ ( t +1/2) − η ( t +1) ˆ ∇ F ( t +1) − ¯ Θ ( t +1/2) k F (63) ≤ (1 + α − ) E k Θ ( t ) − ¯ Θ ( t ) k F + (1 + α ) E (cid:13)(cid:13)(cid:13) η ( t ) ˆ ∇ F ( t ) (cid:16) I − T K (cid:17) + η ( t +1) ˆ ∇ F ( t +1) (cid:13)(cid:13)(cid:13) F (7) ≤ (1 + α − ) E k Θ ( t ) − ¯ Θ ( t ) k F + 6(1 + α ) K ( η ( t ) ) G , (66)and E k Θ ( t +1/2) − η ( t +1) ˆ ∇ F ( t +1) − ˆ Θ ( t +1) k F (63) , (7) ≤ (1 + α − ) k Θ ( t +1/2) − ˆ Θ ( t +1) k F +(1 + α ) K ( η ( t ) ) G . (67) Then, combining (52) and (53), where relevant terms are substituted by (65)-(67), after somemanipulations, we have e ( t +1) (31) ≤ η ( ζ ( t ) )(1 + α − ) E k Θ ( t ) − ¯ Θ ( t ) k F + ξ ( ζ ( t ) )(1 + α − ) k Θ ( t +1/2) − ˆ Θ ( t +1) k F +( η ( ζ ( t ) ) + ξ ( ζ ( t ) ))(1 + α )( η ( t ) ) KG + ( ζ ( t ) ) (2 − ω ) ω d t X τ =0 ˜ N ( τ )0 ≤ max (cid:8) η ( ζ ( t ) ) , ξ ( ζ ( t ) ) (cid:9) (1 + α − ) e ( t ) + 12 max (cid:8) η ( ζ ( t ) ) , ξ ( ζ ( t ) ) (cid:9) (1 + α )( η ( t ) ) KG | {z } part I +( ζ ( t ) ) (2 − ω ) ω d t X τ =0 ˜ N ( τ )0 | {z } part II . (68)Note that by using (62) and choosing α = δζ ( t ) , α = ω and α = p ( t ) [21, (20), Lemma 21],part I in (68) proves to be (1 − p ( t ) ) e ( t ) + p ( t ) ( η ( t ) ) KG . Furthermore, looking into part II,we have ( ζ ( t ) ) t X τ =0 ˜ N ( τ )0 ( a ) ≤ ( ζ ( t ) ) p ( t ) p ( t ) t ˜ N b ) ≤ ( ζ ( t ) ) p ( t ) ˜ p ( t ) (cid:16) q ˜ N t / a ′ (cid:17) a ′ ˜ N c ) ≤ δζ p ( t ) ( ζ ( t ) ) p ˜ N t / a ′ + 1 (cid:16) q ˜ N t / a ′ + 1 (cid:17) a ′ ˜ N d ) = δζ p ( t ) ζ a ′ / p ˜ N t + a ′ / p ˜ N ! a ′ ˜ N = 1 p ( t ) (cid:16) µ . (cid:17) . µt + a ′ / p ˜ N ! δ ( ζ a ′ ) q ˜ N e ) ≤ p ( t ) (cid:16) µ . (cid:17) ( η ( t ) ) δ ( ζ a ′ ) q ˜ N , (69)where ( a ) is due to ˜ N ,T = max t ∈{ ,...,T − } { ˜ N ( t )0 } ; ( b ) is as a result of p ( t ) ≤ ˜ p ( t ) (cf. (32)); ( c ) is because of ˜ p ( t ) ≤ δζ √ ˜ N t / a ′ +1 (cf. (33)); ( d ) is by deﬁnition of ζ ( t ) = ζ √ ˜ N t / a ′ +1 ; and ( e ) results from η ( t ) = . µ t + a with a < a ′ / p ˜ N .Finally, plugging the above results for part I and part II into (68), we complete the proof forLemma 5.2. A PPENDIX FP ROOF OF T HEOREM ˜ p ( t ) is decreasing over t ≥ , by deﬁnition of p ( t ) ( δ, ω ) in (32), p ( t ) ( δ, ω ) proves to benon-increasing over t ≥ . Hence, p ( t ) ( δ, ω ) ≥ p ( T ) ( δ, ω ) , and we can further upper bound the RHS of (35) by replacing p ( t ) ( δ, ω ) with p ( T ) ( δ, ω ) . Based on Lemma 5.2, we have the exactupper bound for e ( t ) given by [21, Lemma 22] e ( t ) ≤ p ( T ) ( δ, ω )) ( η ( t ) ) (cid:18) KG + A ( δ, ω ) q ˜ N (cid:19) . (70)Then, by deﬁnition of e ( t ) (cf. (31)), we have X i ∈V E k ¯ θ ( t ) − θ ( t ) i k ≤ p ( T ) ( δ, ω )) ( η ( t ) ) (cid:18) KG + A ( δ, ω ) q ˜ N (cid:19) . (71)Next, to prove Theorem 5.1, we need to revisit [21, Lemma 20] as follows. E k ¯ θ ( t +1) − θ ∗ k (1) , (30) = E (cid:13)(cid:13)(cid:13) ¯ θ ( t ) − η ( t ) K X i ∈V ˆ ∇ f i ( θ ( t ) i ) + ζ ( t ) K X i ∈V t X τ =0 md ( A ( τ ) ) T ˜ n ( τ ) i − θ ∗ (cid:13)(cid:13)(cid:13) a ) = E (cid:13)(cid:13)(cid:13) ¯ θ ( t ) − η ( t ) K X i ∈V ˆ ∇ f i ( θ ( t ) i ) − θ ∗ (cid:13)(cid:13)(cid:13) | {z } part I + E (cid:13)(cid:13)(cid:13) ζ ( t ) K X i ∈V t X τ =0 md ( A ( τ ) ) T ˜ n ( τ ) i (cid:13)(cid:13)(cid:13) | {z } part II , where ( a ) is as a result of E [( A ( τ ) ) T ˜ n ( τ ) i ] = . As for Part I, [21, Lemma 20] can be directlyapplied, while, for part II, we need to relate it to terms involving learning rate η ( t ) as E (cid:13)(cid:13)(cid:13) ζ ( t ) K X i ∈V t X τ =0 md ( A ( τ ) ) T ˜ n ( τ ) i (cid:13)(cid:13)(cid:13) = 1 K m d ( ζ ( t ) ) t X τ =0 X i ∈V E k ( A ( τ ) ) T ˜ n ( τ ) i k a ) = 1 K m d ( ζ ( t ) ) t X τ =0 ˜ N ( τ )0 (34) ≤ K m d ( ζ ( t ) ) t ˜ N ,T ≤ K m d ζ ( t ) ζ ( δ, ω ) p ˜ N t / a ′ + 1 (cid:16) q ˜ N ,T t / a ′ + 1 (cid:17) a ′ ˜ N ,T = 1 K m d µ .

25 3 . µt + a ′ / p ˜ N ( ζ ( δ, ω ) a ′ ) q ˜ N ,T ( b ) ≤ K ω d µ . η ( t ) ( ζ ( δ, ω ) a ′ ) q ˜ N ,T , (72)where ( a ) follows from A ( τ ) ( A ( τ ) ) T = dm I and E k ˜ n ( τ ) i k = m ˜ N ( τ )0 i ; and ( b ) is due to a

Denoting the optimal solution to (P0) by θ ∗ andthe corresponding objective value F ( θ ∗ ) by F ∗ , the average of iterates ¯ θ ( t +1) satisﬁes E k ¯ θ ( t +1) − θ ∗ k ≤ (1 − η ( t ) µ ) E k ¯ θ ( t ) − θ ∗ k + η ( t ) K ω d µ .

25 ( ζ ( δ, ω ) a ′ ) q ˜ N ,T + ( η ( t ) ) ¯ σ K + 2 η ( t ) (2 η ( t ) L − E [ F (¯ θ ( t ) )] − F ∗ ) + η ( t ) η ( t ) L + LK X i ∈V E k ¯ θ ( t ) − θ ( t ) i k . (73) Since η ( t ) ≤ η (0) = . µa , and a ≥ Lµ , it follows that η ( t ) ≤ L thus leading to Lη ( t ) − ≤− . and η ( t ) L + L ≤ . L . With this fact and the upper bound on the consensus error givenby (71), (73) implies that v ( t +1) e ≤ (1 − η ( t ) µ ) v ( t ) e − η ( t ) f ( t ) e + η ( t ) K ω d µ .

25 ( ζ ( δ, ω ) a ′ ) q ˜ N ,T +( η ( t ) ) ¯ σ K + ( η ( t ) ) L ( p ( T ) ( δ, ω )) (cid:18) G + A ( δ, ω ) K q ˜ N ,T (cid:19) , (74)where v ( t ) e = E k ¯ θ ( t ) − θ ∗ k and f ( t ) e = E [ F (¯ θ ( t ) )] − F ∗ measure, on average over RLC, thedistance to the optimal solution and the optimality gap to to the objective value for problem(P0), respectively. Note that the standard result in [38, Lemma 3.3] is not applicable to (74)to characterize the optimality gap due to the absence of linear terms w.r.t η ( t ) . To capture theperformance of the optimality-gap sequence { f ( t ) e } , we need the following lemma. Lemma F.2 (Variant of [38, Lemma 3.3]):

For non-negative sequences { v ( t ) e } and { f ( t ) e } , η ( t ) = . µ t + a with µ > and a > , and constants A, B, C ≥ , where t = 0 , . . . , T − , suchthat v ( t +1) e ≤ (1 − η ( t ) µ ) v ( t ) e + η ( t ) A + ( η ( t ) ) B + ( η ( t ) ) C − η ( t ) f ( t ) e , (75)we have S T T − X t =0 w ( t ) f ( t ) e ≤ µ . a − . a S T v (0) e + A + 1 . a + T ) TµS T B + 3 . Tµ S T C, (76)where w ( t ) = ( a + t ) and S T = P T − t =0 w ( t ) . Proof:

Following similar procedures as that for proving [38, Lemma 3.3], ﬁrst, multiplying w ( t ) η ( t ) with both sides of (75), we have v ( t +1) e w ( t ) η ( t ) ≤ (1 − η ( t ) µ ) w ( t ) η ( t ) v ( t ) e + w ( t ) A + w ( t ) η ( t ) B + w ( t ) ( η ( t ) ) C − w ( t ) f ( t ) e . (77)To obtain the same relation (1 − η ( t ) µ ) w ( t ) η ( t ) ≤ w ( t − η ( t − as in the original proof, it is equiv-alent to have ( c − a + t ) + 3( a + t ) − ≥ for all t . It is thus sufﬁcient to have ( c − a + t ) + 3( a + t ) − | t =0 = ( c − a + 3 a − ≥ for a choice of c > , whichis satisﬁed by, e.g., c = 3 . and any parameter a ≥ Lµ . ( By deﬁnition, L ≥ µ and thus a ≥ .) Next, by letting t = T − and recursively applying v ( t +1) e w ( t ) η ( t ) ≤ w ( t − η ( t − v ( t ) e + w ( t ) A + w ( t ) η ( t ) B + w ( t ) ( η ( t ) ) C − w ( t ) f ( t ) e , (78) it follows that T − X t =0 w ( t ) f ( t ) e ≤ (1 − η (0) µ ) w (0) η (0) v (0) e + T − X t =0 w ( t ) A + T − X t =0 w ( t ) η ( t ) B + T − X t =0 w ( t ) ( η ( t ) ) C. (79)As a result, plugging η (0) = . µa and dividing both sides of (79) by S T , we obtain (76).According to Lemma F.2, for constants A = K ω d µ . ( ζ ( δ, ω ) a ′ ) q ˜ N ,T , B = ¯ σ K and C = L ( p ( T ) ( δ,ω )) (24 G + q ˜ N ,T A ( δ, ω )/ K ) , as well as f ( t ) e = E [ F (¯ θ ( t ) )] − F ∗ , (74) implies that S T T − X t =0 w ( t ) E [ F (¯ θ ( t ) )] − F ∗ ≤ µ . a − . a S T v (0) e + 158 . (cid:18) G + A ( δ,ω ) K q ˜ N ,T (cid:19) LTµ ( p ( T ) ( δ, ω )) S T +1 . T ( T + 2 a ) µS T ¯ σ K + 1 K C ( δ, ω ) q ˜ N ,T . (80)Combined (80) with F (˜ θ T ) = F ( S T P T − t =0 w ( t ) ¯ θ ( t ) ) ≤ S T P T − t =0 w ( t ) F (¯ θ ( t ) ) , which is due toJensen’s inequality, Theorem 5.1 is thus proved.R EFERENCES [1] H. Xing, O. Simeone, and S. Bi, “Decentralized federated learning via SGD over wireless D2D networks,” in

IEEEInternational Workshop on Signal Processing Advances in Wireless Communications (SPAWC) , Atlanta, GA, USA, May2020.[2] R. Bekkerman, M. Bilenko, and J. Langford,

Scaling up machine learning: Parallel and distributed approaches . CambridgeUniv. Press, 2011.[3] T.-H. Chang, M. Hong, H.-T. Wai, X. Zhang, and S. Lu, “Distributed learning in the nonconvex world: from batch tostreaming data, and beyond,”

IEEE Signal Process. Mag. , vol. 37, no. 3, pp. 26–38, May 2020.[4] P. Kairouz, H. B. McMahan et al. , “Advances and open problems in federated learning,” to appear in Foundations andTrends® in Machine Learning , vol. 14, no. 1, Mar. 2021.[5] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated learning: Challenges, methods, and future directions,”

IEEESignal Process. Mag. , vol. 37, no. 3, pp. 50–60, May 2020.[6] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artiﬁcial intelligencewith edge computing,”

Proc. IEEE , vol. 107, no. 8, pp. 1738–1762, Jun. 2019.[7] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meetsmachine learning,”

IEEE Commun. Mag. , vol. 58, no. 1, pp. 19–25, Jan. 2020.[8] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “QSGD: Communication-efﬁcient SGD via gradientquantization and encoding,” in

Advances in Neural Information Processing Systems (NeurIPS) , Long Beach, CA, USA,Dec. 2017.[9] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convexproblems,” in

International Conference on Machine Learning (ICML) , Stockholm, Sweden, Jul. 2018.[10] A. Abdi, Y. M. Saidutta, and F. Fekri, “Analog compression and communication for federated learning over wirelessMAC,” in

IEEE International Workshop on Signal Processing Advances in Wireless Communications (SPAWC) , Atlanta,GA, USA, May 2020. [11] J. Wu, W. Huang, J. Huang, and T. Zhang, “Error compensated quantized SGD and its applications to large-scale distributedoptimization,” in International Conference on Machine Learning (ICML) , Stockholm, Sweden, Jul. 2018.[12] D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-SGD: Distributed SGD with quantization, sparsiﬁcation andlocal computations,” in

Advances in Neural Information Processing Systems (NeurIPS) , Vancouver, Canada, Dec. 2019.[13] M. M. Amiri and D. G¨und¨uz, “Federated learning over wireless fading channels,”

IEEE Trans. Wireless Commun. , vol. 19,no. 5, pp. 3546–3557, May 2020.[14] R. Xin, S. Kar, and U. A. Khan, “Decentralized stochastic optimization and machine learning: A uniﬁed variance-reductionframework for robust performance and fast convergence,”

IEEE Signal Process. Mag. , vol. 37, no. 3, pp. 102–113, May2020.[15] H. Xu, L. Zhang, E. Sun, and C.-L. I, “Be-ran: Blockchain-enabled RAN with decentralized identity management andprivacy-preserving communication,” arXiv preprint arXiv:2101.10856 , 2021.[16] E. Ozfatura, S. Rini, and D. Gunduz, “Decentralized SGD with over-the-air computation,” arXiv preprint arXiv:2003.04216 ,2020.[17] A. Nedi´c, A. Olshevsky, and M. G. Rabbat, “Network topology and communication-computation tradeoffs in decentralizedoptimization,”

Proc. IEEE , vol. 106, no. 5, pp. 953–976, Apr. 2018.[18] S. Savazzi, M. Nicoli, and V. Rampa, “Federated learning with cooperating devices: A consensus approach for massiveIoT networks,”

IEEE Internet Things J. , vol. 7, no. 5, pp. 4641–4654, May 2020.[19] R. Xin, S. Kar, and U. A. Khan, “An introduction to decentralized stochastic optimization with gradient tracking,” arXivpreprint arXiv:1907.09648v2 , 2019.[20] H. Tang, X. Lian, M. Yan, C. Zhang, and J. Liu, “ D : Decentralized training over decentralized data,” in InternationalConference on Machine Learning (ICML) , Stockholm, Sweden, Jul. 2018.[21] A. Koloskova, S. U. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressedcommunication,” in

International Conference on Machine Learning (ICML) , Long Beach, CA, USA, Jun. 2019.[22] A. Koloskova, T. Lin, S. U. Stich, and M. Jaggi, “Decentralized deep learning with arbitrary communication compression,”in

International Conference on Learning Representations (ICLR) , Apr. 2020.[23] N. Singh, D. Data, J. George, and S. Diggavi, “SPARQ-SGD: Event-triggered and compressed communication indecentralized stochastic optimization,” in

IEEE Conference on Decision and Control (CDC) , Jeju Island, Korea (South),Dec. 2020.[24] M. Chen, H. V. Poor, W. Saad, and S. Cui, “Wireless communications for collaborative federated learning,”

IEEE Commun.Mag. , vol. 58, no. 12, pp. 48–54, Dec. 2020.[25] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,”

IEEE Trans.Wireless Commun. , vol. 19, no. 1, pp. 491–506, Jan. 2020.[26] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efﬁcient radio resource allocation for federated edge learning,” in

IEEE International Conference on Communications Workshops (ICC Workshops) , Dublin, Ireland, Jul. 2020.[27] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federatedlearning over wireless networks,”

IEEE Trans. Wireless Commun. , vol. 20, no. 1, pp. 269–283, Jan. 2021.[28] T. Nishio and R. Yonetani, “Client selection for federated learning with heterogeneous resources in mobile edge,” in

IEEEInternational Conference on Communications (ICC) , Shanghai, China, May 2019.[29] J.-H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation for distributed edge learning with heterogeneous data,”in

IEEE International Symposium on Personal, Indoor and Mobile Radio Communications (PIMRC) , Istanbul, Turkey, Sep.2019. [30] W. Liu, X. Zang, Y. Li, and B. Vucetic, “Over-the-air computation systems: Optimization, analysis and scaling laws,” IEEETrans. Wireless Commun. , vol. 19, no. 8, pp. 5488–5502, Aug. 2020.[31] H. Guo, A. Liu, and V. K. N. Lau, “Analog gradient aggregation for federated learning over wireless networks: Customizeddesign and convergence analysis,”

IEEE Internet Things J. , vol. 8, no. 1, pp. 197–210, Jan. 2021.[32] X. Wei and C. Shen, “Federated learning over noisy channels: Convergence analysis and design examples,” arXiv preprintarXiv: 2101.02198 , 2021.[33] J. Wang, A. K. Sahu, Z. Yang, G. Joshi, and S. Kar, “MATCHA: Speeding up decentralized SGD via matchingdecomposition sampling,” in

Indian Control Conference , Hyderabad, India, Dec. 2019.[34] L. Xiao and S. Boyd, “Fast linear iterations for distributed averaging,”

Systems & Control Letters , vol. 53, no. 1, pp.65–78, Sept. 2004.[35] H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms,” arXiv preprint arXiv:1708.07747 , 2017.[36] T. Husfeldt, “Graph colouring algorithms,” 2015.[37] N. J. Higham,

Accuracy and stability of numerical algorithms . Philadelphia, PA, USA: SIAM, 2002.[38] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsiﬁed SGD with memory,” in