Federated Learning over Wireless Device-to-Device Networks: Algorithms and Convergence Analysis
aa r X i v : . [ c s . I T ] J a n Federated Learning over WirelessDevice-to-Device Networks: Algorithms andConvergence Analysis
Hong Xing, Osvaldo Simeone, and Suzhi Bi
Abstract
The proliferation of Internet-of-Things (IoT) devices and cloud-computing applications over siloeddata centers is motivating renewed interest in the collaborative training of a shared model by mul-tiple individual clients via federated learning (FL). To improve the communication efficiency of FLimplementations in wireless systems, recent works have proposed compression and dimension reduc-tion mechanisms, along with digital and analog transmission schemes that account for channel noise,fading, and interference. This prior art has mainly focused on star topologies consisting of distributedclients and a central server. In contrast, this paper studies FL over wireless device-to-device (D2D)networks by providing theoretical insights into the performance of digital and analog implementationsof decentralized stochastic gradient descent (DSGD). First, we introduce generic digital and analogwireless implementations of communication-efficient DSGD algorithms, leveraging random linear coding(RLC) for compression and over-the-air computation (AirComp) for simultaneous analog transmissions.Next, under the assumptions of convexity and connectivity, we provide convergence bounds for bothimplementations. The results demonstrate the dependence of the optimality gap on the connectivity andon the signal-to-noise ratio (SNR) levels in the network. The analysis is corroborated by experimentson an image-classification task.
Index Terms
Federated learning, distributed learning, decentralized stochastic gradient descent, over-the-air com-putation, D2D networks.
Part of this paper has been presented at the IEEE International Workshop on Signal Processing Advances in WirelessCommunications (SPAWC), May 2020 [1].H. Xing and S. Bi are with the College of Electronic and Information Engineering, Shenzhen University, Shenzhen, 518060,China (e-mails: hong.xing, [email protected]).O. Simeone is with the King’s Communications, Learning, and Information Processing (KCLIP) lab, Department ofEngineering, King’s College London, London, WC2R 2LS, U.K. (e-mail: [email protected]).The work of O. Simeone is supported by the European Research Council under the European Union’s Horizon 2020 researchand innovation program under grant 725731.
I. I
NTRODUCTION
With the proliferation of Internet-of-Things (IoT) devices and cloud-computing applicationsover siloed data centres, distributed learning has become a critical enabler for artificial intel-ligence (AI) solutions [2], [3]. In distributed learning, multiple agents collaboratively train amachine learning model via the exchange of training data, model parameters and/or gradientvectors over geographically distributed computing resources and data.
Federated learning (FL) refers to distributed learning protocols that do not directly exchange the training data in anattempt to reduce the communication load and to limit privacy concerns [4]–[7]. In conventionalFL, multiple clients train a shared model by exchanging model-related parameters with a centralnode. This class of protocols hence relies on a parameter-server architecture, which is typicallyrealized in wireless settings via a base station-centric network topology [8]–[13].There are important scenarios in which a base station-centric architecture is either unavail-able or undesirable due to coverage, privacy, implementation efficiency, or fault-tolerance con-siderations [14], [15]. In such cases, distributed learning must rely on a peer-to-peer, edge-based, communication topology that encompasses device-to-device (D2D) links among individuallearning agents over an arbitrary connectivity graph. With the exception of [1], [16], all priorworks on decentralized FL nevertheless assume either ideal or rate-limited but noiseless
D2Dcommunications. This paper is the first to offer a rigorous convergence analysis of digital andanalog implementations of wireless D2D FL, with the aim for providing insights on the effectof wireless impairments caused by link blockages, pathloss, channel fading, and interference.
A. Related Work
The problem of alleviating the communication load in FL systems has been widely inves-tigated, mostly under the assumption of noiseless, rate-limited links, and star topologies . Keyelements of these solutions are compression and dimension-reduction operations that map theoriginal model parameters or gradient vectors into representations defined by a limited number ofbits and/or sparsity. Important classes of solutions include unbiased compressors (e.g., [8]–[10])and biased compressors with error-feedback mechanisms (e.g., [11]–[13]).In a
D2D architecture , even in the presence of noiseless communication, devices can onlyexchange information with their respective neighbors, making consensus mechanisms essentialto ensure agreement towards the common learning goal [17], [18]. A well-known protocol inte-grating stochastic gradient (SGD) and consensus is
Decentralized Stochastic Gradient Descent (DSGD) , which has been further extended and improved via gradient tracking [19], as well asvia variance-reduction schemes for large data heterogeneity among agents [20]. Similar to FLin star topologies, the communication overhead in decentralized learning can be reduced viacompression, as demonstrated by the CHOCO-SGD algorithm [21], [22]. The protocol, combin-ing DSGD with biased compression, was studied for strongly convex and smooth objectives in[21] and for non-convex smooth objectives in [22]. It was further combined with event-triggeredprotocols in [23].A large number of recent works have proposed communication strategies and multi-access pro-tocols for
FL in wireless star topologies [24]–[28]. At the physical layer, over-the-air computation(AirComp) was investigated in [13], [25], [29]–[32] as a promising solution to support simulta-neous transmissions by leveraging the waveform superposition property of the wireless medium.Unlike conventional digital communication over orthogonal transmission blocks, AirComp isbased on analog, e.g., uncoded, transmission, which enables the estimate of aggregated statisticsdirectly from the received baseband samples. This reduces the communication burden, relievingthe network from the need to decode individual information separately for all participatingdevices. The impact of AirComp on the performance of FL was studied in [31], [32]. Theauthors in [31] proposed an adaptive learning-rate scheduler and investigated the convergenceof the resulting protocol. Reference [32] derived sufficient conditions in terms of signal-to-noiseratio (SNR) for the FL algorithm to attain the same convergence rate as in the ideal noiselesscase.The literature on decentralized FL in wireless D2D architecture is, in contrast, still quitelimited. A DSGD based algorithm termed MATCHA was proposed in [33] by accounting forinterference among nearby links. Adopting a general interference graph, the scheme was basedon sampling a matching decomposition of the graph, whereby the connectivity-critical links areactivated with a higher probability. By relying on a matching decomposition, MATCHA schedulesnoiseless non-interfering communication links in parallel. No attempt was made to leverage non-orthogonal physical layer protocols such as AirComp. In contrast, in the conference version [1] ofthis paper, wireless protocols for both digital and analog implementations of error-compensatedDSGD were designed by including AirComp. However, no theoretical analysis was offered onthe convergence of the considered wireless decentralized FL algorithms.
B. Main Contributions
In this paper, we provide for the first time a rigorous analysis of digital and analog transmissionprotocols tailored to FL over wireless D2D networks in terms of their convergence properties.The contributions of this paper are specifically summarized as follows.1) We introduce generic digital and analog wireless implementations of DSGD algorithms.The protocols rely on compression via random linear coding (RLC) as applied to model differ-ential information. The general protocols enable broadcasting for digital transmission; and bothbroadcasting and AirComp for analog transmission.2) Under the assumptions of convexity and connectivity, we derive convergence bounds for thegeneric digital wireless implementation. The result demonstrates the dependence of the optimalitygap on the connectivity of the graph and on the model differential estimation error due tocompression.3) We also provide convergence bounds for the analog wireless implementation. The analysisreveals the impact of topology and channel noise, as well as the importance of implementing an adaptive consensus step size . To the best of our knowledge, this is the first time that an adaptiveconsensus step size is shown to be beneficial for convergence.4) We provide numerical experiments for image classification, confirming the benefits ofthe proposed adaptive consensus rate, and demonstrating the agreement between analytical andempirical results.The remainder of this paper is organized as follows. The system model is presented in SectionII. Digital and analog transmission protocols are introduced in Section III. The convergence anal-ysis for both implementations is presented in Section IV and Section V, respectively. Numericalperformance results are described in Section VI, followed by conclusions in Section VII.
C. Notations
We use the upper case boldface letters for matrices and lower case boldface letters for vectors.We also use k · k to denote the Euclidean norm of a vector or the spectral norm of a matrix,and k · k F to denote the Frobenius norm of a matrix. The average of vectors x i over i ∈ V is defined as ¯ x = K P i ∈V x i . Notations Tr ( · ) and ( · ) T denote the trace and the transpose of amatrix, respectively. E [ · ] stands for the statistical expectation of a random variable. I representsan identity matrix with appropriate size, and , indicates a mathematical definition. λ i ( · ) denotesthe i th largest eigenvalue of of a matrix. II. S
YSTEM M ODEL
In this paper, we consider a FL problem in a decentralized setting as shown in Fig. 1, in whicha set V = { , . . . , K } of K devices can only communicate with their respective neighbors overa wireless D2D network whose connectivity is characterized by an undirected graph G ( V , E ) ,with V denoting the set of nodes and E ⊆ { ( i, j ) ∈ V × V | i = j } the set of edges. The set ofneighbors of node i is denoted as N i = { j ∈ V | ( i, j ) ∈ E } . Following the FL framework, eachdevice has available a local data set, and all devices collaboratively train a machine learningmodel by exchanging model-related information without directly disclosing data samples to oneanother. ! " Fig. 1. The connectivity graph G ( V , E ) for a wireless D2D network. A. Learning Model
Each device i ∈ V has access to its local data set D i , which may have non-empty intersectionwith the data set D j of any other device j ∈ V , i = j . All devices share a common machinelearning model class, which is parametrized by a vector θ ∈ R d × . As a typical example, themodel class may consist of a neural network with a given architecture. The goal of the networkis to solve the empirical risk minimization problem [19], [22] (P0) : Minimize θ F ( θ ) , K X i ∈V f i ( θ ) , where f i ( θ ) = |D i | P ξ ∈D i l ( θ , ξ ) is the local empirical risk function for device i with l ( θ ; ξ ) denoting the loss accruing from parameter θ on data sample ξ ∈ D i , which may include theeffect of regularization. To enable decentralized learning, we adopt CHOCO-SGD, a communication-efficient variantof DSGD [21]. At the start of each iteration t + 1 , device i ∈ V has in its memory its currentmodel iterate θ ( t ) i , the corresponding estimated version ˆ θ ( t ) i and the estimated iterates ˆ θ ( t ) j forall its neighbors j ∈ N i . We note that an equivalent version of the algorithm that requires lessmemory can be found in [21, Algorithm 6], but we do not consider it here since it does notchange the communication requirements. Furthermore, at each iteration t , device i ∈ V firstexecutes a local update step by SGD based on its data set D i as θ ( t +1/2) i = θ ( t ) i − η ( t ) ˆ ∇ f i ( θ ( t ) i ) , (1)where η ( t ) denotes the learning rate, and ˆ ∇ f i ( θ ( t ) i ) is an estimate of the exact gradient ∇ f i ( θ ( t ) i ) obtained from a mini-batch D ( t ) i ⊆ D i of the data set, i.e., ˆ ∇ f i ( θ ( t ) i ) = |D ( t ) i | P ξ ∈D ( t ) i ∇ l ( θ ( t ) i ; ξ ) .Then, each device i ∈ V compresses the difference θ ( t +1/2) i − ˆ θ ( t ) i between the locally updatedmodel (1) and the previously estimated iterate ˆ θ ( t ) i . The compressed difference C ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) is then exchanged with the neighbors of node i . Assuming that communication is reliable — anassumption that we will revisit in the rest of the paper. Each device i ∈ V updates the estimatedmodel parameters ˆ θ ( t ) j for itself and for its neighbors as ˆ θ ( t +1) j = ˆ θ ( t ) j + D ( t ) (cid:16) C ( t ) ( θ ( t +1/2) j − ˆ θ ( t ) j ) (cid:17) , j ∈ { i } ∪ N i , (2)where D ( t ) ( · ) is a decoding function. Next, device i ∈ V executes a consensus update step bycorrecting the updated model (1) using the estimated parameters (2) as θ ( t +1) i = θ ( t +1/2) i + ζ ( t ) X j ∈N i ∪{ i } w ij (cid:16) ˆ θ ( t +1) j − ˆ θ ( t +1) i (cid:17) , (3)where ζ ( t ) is the consensus rate, and the mixing matrix W = W T ∈ R K × K is selected to be doubly stochastic , i.e., [ W ] ij = w ij ≥ , W = , T W = T and k W − T / K k < .A typical choice is to set w ij = α for all j ∈ N i , w ii = 1 − |N i | α , and w ij = 0 otherwise,where constant α > is a design parameter. We postpone discussion regarding the compressionoperator C ( t ) ( · ) and the decoding operator D ( t ) ( · ) to Section II-C. The considered decentralizedlearning protocol is summarized in Algorithm 1.Finally, we make the following assumptions that are widely adopted in the literature ondecentralized stochastic optimization [21]. Algorithm 1:
Decentralized Learning with Noiseless Communication
Input :
Consensus step size ζ ( t ) , SGD learning step size η ( t ) , connectivity graph G ( V , E ) and mixing matrix W Initialize at each node i ∈ V : θ (0) i , ˆ θ (0) j = , ∀ j ∈ N i S { i } ; for t = 0 , , . . . , T − do for each device i ∈ V do in parallel update θ ( t +1/2) i = θ ( t ) i − η ( t ) ˆ ∇ f i ( θ ( t ) i ) ; compress the difference θ ( t +1/2) i − ˆ θ ( t ) i to obtain C ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) ; for each neighboring device j ∈ N i do in parallel send C ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) ; receive C ( t ) ( θ ( t +1/2) j − ˆ θ ( t ) j ) ; end update ˆ θ ( t +1) j = ˆ θ ( t ) j + D ( t ) (cid:16) C ( t ) ( θ ( t +1/2) j − ˆ θ ( t ) j ) (cid:17) , for j ∈ { i } ∪ N i ; udpate θ ( t +1) i = θ ( t +1/2) i + ζ ( t ) P j ∈N i ∪{ i } w ij (cid:16) ˆ θ ( t +1) j − ˆ θ ( t +1) i (cid:17) . end endOutput: θ ( T − i , ∀ i ∈ V Assumption 2.1:
Each local empirical risk function f i ( θ ) , i ∈ V , is L -smooth and µ -stronglyconvex, that is, for all θ ∈ R d × and θ ∈ R d × , it satisfies the inequalities f i ( θ ) ≤ f i ( θ ) + ∇ f i ( θ ) T ( θ − θ ) + L k θ − θ k , (4)and f i ( θ ) ≥ f i ( θ ) + ∇ f i ( θ ) T ( θ − θ ) + µ k θ − θ k . (5) Assumption 2.2:
The variance of the mini-batch gradient ˆ ∇ f i ( θ i ) is bounded as E D ( t ) i [ k ˆ ∇ f i ( θ i ) − ∇ f i ( θ i ) k ] ≤ σ i , (6)and its expected Euclidean norm is bounded as E D ( t ) i [ k ˆ ∇ f i ( θ i ) k ] ≤ G , (7)where the expectation E D ( t ) i [ · ] is taken over the selection of a mini-batch D ( t ) i ⊆ D i . B. Communication Model
As seen in Fig. 2, at the end of every iteration t , communication takes place within onecommunication block of a total number N of channel uses spanning over M equal-length slots,denoted by S = { , . . . , M } . Slow fading remains constant across all iterations, and is binary,determining whether a link is blocked ar not. A link ( i, j ) ∈ E is by definition not blocked ,while all the other links ( i, j ) / ∈ E are blocked. We assume that the connectivity graph G ( V , E ) with all the unblocked links as edges satisfies the following assumption. Assumption 2.3:
Graph G ( V , E ) is a connected graph.For all unblocked links ( i, j ) ∈ E , the channel coefficient between device i and j is modelled as h ′ ( t ) ij , p A (cid:18) d ij d (cid:19) − γ h ( t ) ij , (8)where the fast fading coefficient h ( t ) ij ∼ CN (0 , remains unchanged within one communicationblock and varies independently across blocks, and the path loss gain √ A ( d / d ij ) γ /2 is constantacross all iterations, where A is the average channel power gain at reference distance d ; d ij isthe distance between device i and j ; and γ is the path loss exponent factor. ! " t T - Tx 1, 6 2, 7 3 4 5Rx 2, 3, 5 1, 3, 4, 5 1, 2, 4, 5 3, 5, 7 3, 4, 6, 7 ! " !" .)/0()&"%* t !" N Fig. 2. Timeline of training iterations and communication blocks: A communication block of N channel uses, divided into M slots, is employed to exchange the compressed difference between model parameters among neighboring devices. Each device is subject to an energy constraint of
N P ( t ) per communication block. If a deviceis active for M ′ ≤ M slots, the energy per symbol is hence given by NP ( t ) M ′ N / M = P ( t ) MM ′ . Themean-square power of the added white Gaussian noise (AWGN) is denoted as N . C. Compression
In this subsection, we describe the assumed compression operator C ( · ) and decompressionoperator D ( · ) that are used in the update (3). We specifically adopt random linear coding (RLC) compression [10]. Let A ( t ) = √ m HR ( t ) be the linear encoding matrix, where H ∈ {± } m × d with m ≤ d is a partial Hadamard matrix with mutually orthogonal rows, i.e., d HH T = I ;and R ( t ) ∈ R d × d is a diagonal matrix with its diagonal entries [ R ( t ) ] ii , r ( t ) i drawn fromuniform distributions such that Pr( r ( t ) i = 1) = Pr( r ( t ) i = −
1) = 0 . , for all i = 1 , . . . , d .The compression operator is given by the linear projection C ( t ) ( u ) = A ( t ) u , while decodingtakes place as D ( t ) ( v ) = md ( A ( t ) ) T v . The concatenation of the compression and decompressionoperators, namely, D ( t ) ( C ( t ) ( u )) = md ( A ( t ) ) T A ( t ) u , satisfies the compression operator [10], [21] E (cid:13)(cid:13)(cid:13) u − md ( A ( t ) ) T A ( t ) u (cid:13)(cid:13)(cid:13) = (cid:16) − md (cid:17) k u k , for all u ∈ R d × . (9)We note that the random matrices { R ( t ) } need to be shared among devices prior to the start ofthe communication protocol such that the same random sequence { A ( t ) } is agreed upon by alldevices. III. D IGITAL AND A NALOG T RANSMISSION P ROTOCOLS
In this section, we describe digital and analog wireless implementations of the decentralizedlearning algorithm reviewed in the previous section. The implementations are meant to serve asprototypical templates for the deployment of decentralized learning. In practice, specific protocolsare in need to specify the scheduling strategy used to allocate slots in Fig. 2 to devices. Theanalysis in this paper, to be detailed in Section IV and Section V, applies to any schedulingalgorithm.
A. Digital Transmission
In digital transmission protocol, devices represent their model updates as digital messagesfor transmission. The number B ( t ) i of bits that device i ∈ V can successfully broadcast to itsneighbors during a slot allocated to it by the scheduling algorithm is limited by the neighboringdevice with the worst channel power gain. Accordingly, we have B ( t ) i = NM log (cid:18) P ( t ) N M min j ∈N i | h ′ ( t ) ij | (cid:19) . (10)We recall that, in (10), the number M of time slots per iteration is decided by the schedulingscheme.To quantize the encoded vector A ( t ) i ( θ ( t +1/2) i − ˆ θ ( t ) i ) ∈ R m × into B ( t ) i bits, we employ asimple per-element b -bit quantizer Q b ( · ) with chip-level precision so that b = 64 or b = 32 is for double-precision or single-precision floating-point, respectively, according to IEEE standard. Communication constraints thus impose the inequality bm ( t ) i ≤ B ( t ) i , which is satisfiedby setting m ( t ) i = ⌊ B ( t ) i / b ⌋ . Based on the (received) quantized signal, each device i ∈ V updatesthe estimated model parameters of its own as well as of its neighbors in j ∈ { i } ∪ N i as (cf. (2)) ˆ θ ( t +1) j = ˆ θ ( t ) j + m ( t ) j d ( A ( t ) j ) T Q b (cid:16) A ( t ) j ( θ ( t +1/2) j − ˆ θ ( t ) j ) (cid:17) . (11)In order to implement update (11), each node j ∈ V and its neighbors in set N j can share apriori a common sequence of (pseudo-)random matrix ˜ A ( t ) j ∈ R m × d satisfying the assumptiondescribed in Section II-C. If node j sends its current value m ( t ) j to all neighbors and if m ( t ) j ≤ m ,all nodes can thus select the same submatrix A ( t ) j from ˜ A ( t ) j to evaluate (11). The described digitalimplementation is summarized in Algorithm 2 in Appendices. B. Analog Transmission
With analog transmission, devices directly transmit their respective updated parameters bymapping analog signals to channel uses, without the need for digitization. As studied in [25],[30], in addition to broadcast as in digital transmission, it is also useful to schedule all devicesthat share a common neighbor for simultaneous transmission in order to enable AirComp. Theconsidered class of protocols can accommodate any scheduling scheme that, as in [1], operatesover pairs of consecutive time slots in order to leverage AirComp. In the first slot of each pair,one or more center nodes receive a superposition of the signals simultaneously transmitted byall their respective neighbors. In the second slot, the center nodes serve as broadcast transmitterscommunicating to all their neighbors. The total number M of time slots is thus given by twiceas the number n of pairs of time slots, which is specified by the scheduling policy in use.To elaborate on the operation of analog transmission protocol, we will use the followingnotation. For each device i ∈ V , we define a set S Tx i ⊆ S of transmission slots, with S Tx i = S BT i ∪ S AT i partitioned into disjoint subsets S BT i and S AT i ( S BT i ∩ S AT i = ∅ ). Subset S BT i denotesthe set of transmission slots in which device i broadcasts to its neighbors, and S AT i denotesthe set of slots in which device i transmits to enable AirComp. Similarly, we define the set S Rx i ⊆ S of receiving slots for device i as S Rx i = S BR i ∪ S AR i with S BR i ∩ S AR i = ∅ , where S BR i and S AR i denote the sets of receiving slots in which device i ∈ V receives from a transmitterin broadcast and AirComp modes, respectively. Fig. III-B serves as an example illustrating theabove definitions. ! " !" ! " !" Fig. 3. An example illustrates pairs of consecutive time slots for analog transmissions given a scheduling scheme that yields M = 6 for the connectivity graph shown in Fig. 1. There are following non-empty sets for each node: S AT1 = { } , S BT1 = { } , S AR1 = { } , S BR1 = { } ; S AT2 = { , } , S BR2 = { , } ; S BT3 = { } , S AR3 = { } ; S AT4 = { , } , S BT4 = { } , S AR4 = { } , S BR4 = { , } ; S AT5 = { } , S BT5 = { } , S AR5 = { } , S BR5 = { } ; S AT6 = { } , S BR6 = { } ; and S AT7 = { , } , S BR7 = { , } . We now describe the transmitted and the received signals in each pair of slots of the commu-nication protocol.
Odd slots : All devices i ∈ V operating in AirComp mode for a center node j in an odd slot s ∈ S AT i ∩ S AR j concurrently transmit the signals x ( t,s ) ij = q γ ( t,s ) j h ′ ( t ) ij w ji A ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) , (12)where γ ( t,s ) j is a power scaling factor for channel alignment at device j . The receiving centernode, device j obtains y ( t,s ) j = q γ ( t,s ) j X i ∈N ( s ) j w ji A ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) + n ( t,s ) j , (13)where N ( s ) j is the neighboring set of device j operating in AirComp at slot s , and n ( t,s ) j ∼CN ( , N I ) is the received AWGN at slot s of iteration t . Device j estimates the combined model parameters P i ∈N ( s ) j w ji ( θ ( t +1/2) i − ˆ θ ( t ) i ) via the linear estimator ˆ y ( t,s ) j = md ( A ( t ) ) T ℜ (cid:26) y ( t,s ) j / q γ ( t,s ) j (cid:27) . (14) Even slots : Any device i ∈ V operating in broadcast mode in an even slot s ∈ S BT i transmitsa signal x ( t,s ) i = q α ( t,s ) i A ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) , (15) where α ( t,s ) i is device i ’s transmitting power scaling factor in slot s ∈ S BT i of iteration t . Eachneighboring device j ∈ N ( s ) i , with s ∈ S BR j , receives from device i the signal y ( t,s ) ij = q α ( t,s ) i h ′ ( t ) ij A ( t ) ( θ ( t +1/2) i − ˆ θ ( t ) i ) + n ( t,s ) j , (16)where n ( t,s ) j ∼ CN ( , N I ) is the received AWGN. Device j estimates the signal θ ( t +1/2) i − ˆ θ ( t ) i via the linear estimator ˆ y ( t,s ) ij = w ji md ( A ( t ) ) T ℜ y ( t,s ) ij q α ( t,s ) i h ′ ( t ) ij , (17)where ℜ{·} denotes the real part of its argument.Next, device j ∈ V updates its estimate of the combined model parameters from all neighboringdevices in N j by aggregating the estimates obtained at all receiving slots in set S Rx j = S BR j ∪S AR j as ˆ y ( t +1) j = ˆ y ( t ) j + X s ∈S AR j ˆ y ( t,s ) j + X s ∈S BR j ˆ y ( t,s ) i s j , (18)where node i s ∈ N j is the node that transmits in broadcast mode in slot s ∈ S BR j . The initialestimate of the combined model parameters is given by ˆ y (0) i = , ∀ i ∈ V .The power scaling parameters γ ( t,s ) j (cf. (12)) and α ( t,s ) i (cf. (15)) for s ∈ S Tx i need to beproperly chosen such that the power consumed by device i ∈ V per communication blocksatisfies X s ∈S AT i E [ k x ( t,s ) ij s k ] + X s ∈S BT i E [ k x ( t,s ) i k ] ≤ N P ( t ) , ∀ i ∈ V , (19)where node j s is the center node connected to node i in slot s ∈ S AT i . Applying a simple equalpower policy across different transmission slots of a device i ∈ V for all communication blocks,we have (cf. (19)) E [ k x ( t,s ) ij k ] ≤ N P ( t ) / |S Tx i | , ∀ s ∈ S AT i , (20) E [ k x ( t,s ) i k ] ≤ N P ( t ) / |S Tx i | , ∀ s ∈ S BT i . (21)In addition, device j ∈ V needs to update the estimate of its own model parameter ˆ θ ( t +1) j as ˆ θ ( t +1) j = ˆ θ ( t ) j + md ( A ( t ) ) T A ( t ) ( θ ( t +1/2) j − ˆ θ ( t ) j ) . (22)Finally, device j ∈ V approximates update (3) as θ ( t +1) j = θ ( t +1/2) j + ζ ( t ) (cid:16) w jj ˆ θ ( t +1) j + ˆ y ( t +1) j − ˆ θ ( t +1) j (cid:17) . (23)To sum up, the proposed analog implementation is presented in Algorithm 3 in Appendices. IV. C
ONVERGENCE A NALYSIS F OR D IGITAL T RANSMISSION
In this section, we derive convergence properties of the general class of digital transmissionprotocols presented in Section III-A. The analysis holds for any fixed transmission schedule,which determines the number M of slots. We start by recalling that, at each iteration t , update(11) is carried out by device i ∈ V for all nodes j ∈ { i } ∪ N i . In (11), the concatenation ofcompression, quantization, and decompression yields an output vector m ( t ) j d ( A ( t ) j ) T Q b ( A ( t ) j u ( t ) j ) for the input vector u ( t ) j = θ ( t +1/2) j − ˆ θ ( t ) j . The number m ( t ) j = ⌊ B ( t ) j / b ⌋ of rows of matrix A ( t ) j at iteration t depends on the current rate (10) supported by the fast fading channels betweendevice j and its neighbors. Taking the randomness of the fading realizations into account, thecounterpart of the compression operator (9) under digital transmission is given by the followinglemma. Lemma 4.1:
On average over RLC and fading channels, the mean-square estimation error forthe concatenation of compression, quantization, and decompression under digital transmissionsatisfies E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) u − m ( t ) i d ( A ( t ) i ) T Q b ( A ( t ) i u ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:0) − ω ( t ) (cid:1) k u k , (24)for all u ∈ R d × and for all i ∈ V , with ω ( t ) = min i ∈V ω ( t ) i , where we have ω ( t ) i = d + d P dn =2 G ( t ) i ( n ) with the function G ( t ) i ( n ) : Z + R , i ∈ V , defined as G ( t ) i ( n ) = exp − N P ( t ) M A (cid:16) nb MN − (cid:17) X j ∈N i (cid:18) d ij d (cid:19) r ! . (25) Proof:
Please refer to Appendix C.By (24), the parameter ω = min t ∈{ ,...,T − } ω ( t ) ∈ [0 , is a measure of the quality of thereconstruction of the model difference used in update (11). Supposing static channel conditions inwhich the transmission rate (10) remains constant over iterations, we have m ( t ) i = m i , ∀ i ∈ V . Inthese conditions, the right-hand side (RHS) of (24) can reduce to (1 − m i d ) k u k (see Appendix C),which is exactly the RHS of (9) given m = m i and A ( t ) = A ( t ) i . Note that ω ( t ) is increasing with P ( t ) . In particular, when the transmission power P ( t ) → ∞ , it is seen in (25) that G i ( n ) → ,thus leading to zero mean-square estimation error ( ω ( t ) → ).With Lemma 4.1, the convergence properties of the digital protocol can be quantified ina manner similar to [21, Theorem 4]. To this end, we define the following topology-relatedparameters dependent on the mixing matrix W : the spectral gap δ = 1 − k T K − W k ; the parameter β = k I − W k ; and the function p ( δ, ω ) = δ ω δ + δ +4 β +2 δβ − δω ) that depends on thespectral gap δ and on the model-difference estimation quality ω = min t ∈{ ,T − } { ω ( t ) } . Then, theconvergence of the digital implementation is provided by the following theorem. Theorem 4.1 (Optimality Gap for Digital Transmission [21, Theorem 19]):
For learning rate η ( t ) = . µ t + a with a ≥ max { p ( δ,ω ) , Lµ } , and consensus step size ζ ( t ) = p ( δ,ω ) δ , ζ ( δ, ω ) , onaverage over RLC and fading channels { h ′ ( t ) ij } , Algorithm 2 yields an optimality gap satisfying E [ F (˜ θ T )] − F ∗ ≤ E " S T T − X t =0 w ( t ) F (¯ θ ( t ) ) − F ∗ ≤ µ . a − . a S T v (0) e + 1 . a + T ) TµS T ¯ σ K | {z } centralized error + 158 . × LTµ ( p ( δ, ω )) S T G | {z } consensus error , (26)where w ( t ) = ( a + t ) ; S T = P T − t =0 w ( t ) ; ˜ θ T = S T P T − t =0 w ( t ) ¯ θ ( t ) is the weighted sum ofthe average iterate ¯ θ ( t ) = K P i ∈V θ ( t ) i across all the communication rounds; F ∗ denotes theoptimum objective value for problem (P0); µ is the parameter for µ -strongly convex function F ( θ ) ; v (0) e = k ¯ θ (0) − θ ∗ k measures the initial distance to the optimal model parameter; and ¯ σ = K P i ∈V σ i is the average of the mini-batch gradient variance over all devices. Remark 4.1:
The labels “centralized error” and “consensus error” in (26) refer to decompositionof the upper bound on optimality gap into a term that accounts for the performance of the averagemodel ¯ θ ( t ) = K P i ∈V θ ( t ) i — the “centralized error”, and a term that measures the disagreementamong agents — the “consensus error”. To gain insights into how wireless resources, channelconditions, and topology of the connectivity graph affect the performance of the digital wirelessimplementation, we can rewrite (26) using relevant parameters in “big O” notation as [21,Theorem 4] E [ F (˜ θ T )] − F ∗ ≤ O (cid:18) ¯ σ µKT (cid:19) + O (cid:18) ¯ σ µKT + LG µ ( p ( δ, ω )) T (cid:19) + O (cid:16) µT (cid:17) . (27)This result shows that when the total number of iterations T is sufficiently large, the optimalitygap (27) behaves as O (cid:16) ¯ σ µKT (cid:17) , which recovers the convergence rate of centralized SGD withideal communications. However, when the wireless communication resources are limited, andhence N T ≪ ∞ , the second term, scaling as O (cid:0) T (cid:1) , becomes equally important in (27),demonstrating the impact of the topology via δ and β , as well as the effect of the quality ofdigital transmission via ω . Since the function p ( δ, ω ) is monotonically increasing with δ ∈ [0 , and ω ∈ [0 , , the second term in (27) decreases with δ and ω . This implies that convergence is improved for more connected graphs with larger δ [34], and for smaller estimation errors withlarger ω . A. Numerical Illustration
In this subsection, we corroborate the analysis by numerically evaluating each constituentterm in the upper bounds (26) on the optimality gap. We consider a setup consisting of K = 20 devices that are located at randomly and independently selected distances in the interval [50 , m away from a center position, with all angles (in radius) uniformly distributed in the interval [0 , π ] . The connectivity graph, accounting for the impact of slow fading, is modelled as: (i) acomplete graph; (ii) a planar × grid graph; (iii) a planar × grid graph with torus wrapping;or (iv) a star graph as in conventional FL. We set the strong-convexity parameter as µ = 0 . ,and the smoothness factor as L = 0 . . We plot the upper bound at iteration t normalized by thecorresponding value at t = 0 , hence evaluating the improvement in the expected optimality gap.The SNR is defined as the received SNR, averaged over fast fading, at a distance of meters(m) away from the deployment center. Other parameters are set as d = 1 m, A = 10 − . , γ = 3 . , N = − dBm and α = 2/( λ ( L ) + λ K − ( L )) , where L is the Laplacian of theconnectivity graph.We apply the vertex coloring-based scheduling proposed in [1] in order to determine thenumber of slots. Fig. 4 plots separately the centralized and the consensus errors, as well asthe overall error in (26), for different number of iterations t = 2000 and t = 5000 under aplanar topology with torus wrapping. The centralized error does not depend on the SNR, and itdecreases at the fastest rate over iterations. The consensus error decreases with the received SNRdue to the improvement in the parameter ω that characterizes the quality of model reconstruction(cf. (26)). As a result, the consensus error dominates the optimality gap until the received SNRlevel increases to a sufficiently large value dependent on the iteration t . Furthermore, the overalloptimality gap approaches the centralized error as the SNR increases.We now turn to an analysis of the impact of the topology of the connectivity graph on theconvergence for digital transmission. To this end, we set the same received SNR for all devicesignoring the impact of pathloss in order to isolate the impact of different topologies. We employTDMA-based scheduling that assigns only one device as the transmitter at one slot such thatthere are equal number M = K of slots for all topologies. Fig. 5 reports the optimality gap,along with the constituent errors in (26). The optimality gap decreases with the spectral gap δ of 5 H F H L Y H G 6 1 5 G % í í í í 1 R U P D O L ] H G X S S H U E R X Q G R Q [ F ( ̃ θ θ ̃] − F ∗ R S W L P D O L W \ J D S F R Q V H Q V X V H U U R U F H Q W U D O L ] H G H U U R U (a) t = 2000 . 5 H F H L Y H G 6 1 5 G % í í í í 1 R U P D O L ] H G X S S H U E R X Q G R Q [ F ( ̃ θ θ ̃] − F ∗ R S W L P D O L W \ J D S F R Q V H Q V X V H U U R U F H Q W U D O L ] H G H U U R U (b) t = 5000 .Fig. 4. Normalized upper bounds (26) on the optimality gap versus the received SNR levels for a planar grid graph with toruswrapping and N = 10 ( a = 4 × ). the connectivity graph, which equals δ = 1 , . , . , and . , for complete graph, planargraph with and without torus wrapping, and star topology, respectively. Fig. 5 shows that theconsensus error contributes very little to the overall error for densely connected graphs, such asthe complete graph and the planar grid with torus wrapping, while it becomes dominant in lessdensely connected graphs, such as the planar grid and the star graph. &