[PDF] Wirelessly Powered Federated Edge Learning: Optimal Tradeoffs Between Convergence and Power Transfer

Abstract

Federated edge learning (FEEL) is a widely adopted framework for training an artificial intelligence (AI) model distributively at edge devices to leverage their data while preserving their data privacy. The execution of a power-hungry learning task at energy-constrained devices is a key challenge confronting the implementation of FEEL. To tackle the challenge, we propose the solution of powering devices using wireless power transfer (WPT). To derive guidelines on deploying the resultant wirelessly powered FEEL (WP-FEEL) system, this work aims at the derivation of the tradeoff between the model convergence and the settings of power sources in two scenarios: 1) the transmission power and density of power-beacons (dedicated charging stations) if they are deployed, or otherwise 2) the transmission power of a server (access-point). The development of the proposed analytical framework relates the accuracy of distributed stochastic gradient estimation to the WPT settings, the randomness in both communication and WPT links, and devices' computation capacities. Furthermore, the local-computation at devices (i.e., mini-batch size and processor clock frequency) is optimized to efficiently use the harvested energy for gradient estimation. The resultant learning-WPT tradeoffs reveal the simple scaling laws of the model-convergence rate with respect to the transferred energy as well as the devices' computational energy efficiencies. The results provide useful guidelines on WPT provisioning to provide a guaranteer on learning performance. They are corroborated by experimental results using a real dataset.

Full PDF

11 Wirelessly Powered Federated Edge Learning:Optimal Tradeoffs Between Convergence and Power Transfer

Qunsong Zeng, Yuqing Du, and Kaibin Huang

Abstract

Federated edge learning (FEEL) is a widely adopted framework for training an artiﬁcial intelligence (AI) model distributively at edge devices to leverage their data while preserving their data privacy. Theexecution of a power-hungry learning task at energy-constrained devices is a key challenge confrontingthe implementation of FEEL. To tackle the challenge, we propose the solution of powering devices using wireless power transfer (WPT). To derive guidelines on deploying the resultant wirelessly powered FEEL (WP-FEEL) system, this work aims at the derivation of the tradeoff between the model convergenceand the settings of power sources in two scenarios: 1) the transmission power and density of power-beacons (dedicated charging stations) if they are deployed, or otherwise 2) the transmission power ofa server (access-point). The development of the proposed analytical framework relates the accuracy ofdistributed stochastic gradient estimation to the WPT settings, the randomness in both communicationand WPT links, and devices’ computation capacities. Furthermore, the local-computation at devices(i.e., mini-batch size and processor clock frequency) is optimized to efﬁciently use the harvested energyfor gradient estimation. The resultant learning-WPT tradeoffs reveal the simple scaling laws of themodel-convergence rate with respect to the transferred energy as well as the devices’ computationalenergy efﬁciencies. The results provide useful guidelines on WPT provisioning to provide a guaranteeron learning performance. They are corroborated by experimental results using a real dataset.

I. I

NTRODUCTION

Recent years have seen a growing trend of deploying machine learning algorithms at thewireless network edge to distill artiﬁcial intelligence (AI) from the abundant data at edgedevices (e.g., sensors and smart phones), giving rise to an area called edge learning [1], [2].Among others, federated edge learning (FEEL) is perhaps the most widely adopted frameworkfor its feature of preserving data privacy [3]–[5]. Speciﬁcally, instead of uploading data fromdevices, the framework involves a server distributing a learning task over devices based ondistributed implementation of stochastic gradient descent (SGD). One challenge confronting

Q. Zeng, Y. Du, and K. Huang are with The University of Hong Kong, Hong Kong. Contact: K. Huang ([email protected]). a r X i v : . [ c s . I T ] F e b federated learning in practice is that executing a complex task (e.g., training of a large-scale convolutional neural network (CNN)) at edge devices drains their batteries. To tackle thischallenge, we propose the solution of deploying wireless power transfer (WPT) to deliver todevices the energy they need for computation and communication. To understand the performanceof the resultant wirelessly powered FEEL (WP-FEEL) system, this work aims at quantifying theoptimal tradeoffs between model convergence and settings of power sources, which can be power-beacons (charging stations) or the server, when devices optimally allocate harvested energy forcomputation and communication to accelerate convergence. The derived tradeoffs, termed theoptimal learning-WPT tradeoffs , yield useful insights into system design and deployment.The current research on implementing FEEL in wireless networks can be separated into twomain thrusts focusing on tackling two different challenges. One challenge is the communicationbottleneck arising from the wireless uploading of high-dimensional model updates (either localmodels or local stochastic gradients) from many devices. Attempts to overcome the bottleneckhave led to the design of a new class of communication techniques for efﬁcient FEEL includingover-the-air updates aggregation [5]–[7], resource management [8]–[10], adaptive uploadingfrequency control [4], device scheduling [11], and quantization [12]. The other challenge isto execute energy consuming tasks at edge devices as mentioned earlier. This issue has beenaddressed in a series of works on designing techniques for jointly managing computation andcommunication resources [13]–[16] under the criterion of minimizing the total devices’ energyconsumption during the learning process. Addressing the same issue, we propose an alternativeand direct approach of powering devices using WPT. Though there exist rich convergence analysisin the prior work, the tradeoffs between energy consumption of devices and convergence have notyet been crystallised. The derivation of the desired learning-WPT tradeoff is even more complexdue to new issues arising from WPT especially the following two. First, the unreliabilities of bothcommunication and WPT links jointly affect the number of active devices. Second, each deviceneeds manage harvested energy for both communication and computation. Their coupling resultsin the channel dependence of local computation (i.e., mini-batch size and processor frequency)and hence learning performance.There exists a rich literature on the application of microwave based WPT to power differenttypes of wireless networks ranging from communication networks to sensor networks to thosesupporting mobile-edge computing (see recent surveys in [17], [18]). There exist three maintopologies of wirelessly powered networks [19]: 1) the integration of WPT with downlink transmission, called simultaneous wireless information and power transfer (SWIPT) (see e.g.,[20]), 2) downlink WPT to power uplink transmission (see e.g., [21]), and 3) separated WPTserved by power-beacons and radio access by base stations (see e.g., [22]). As the communicationbottleneck of a FEEL system lies in the uplink, the last two topologies are relevant, both of whichare considered in this paper. Despite building on the existing network topologies, WP-FEELsystems differ from their conventional wirelessly powered communication systems in severalaspects. First, the performance of the former is measured using learning related metrics (i.e.,convergence rate or test accuracy) and that of the the latter is measured by communication relatedmetrics such as throughput (see e.g., [21]), communication energy efﬁciency (see e.g., [23]),and rate-harvested-energy tradeoff (see e.g., [24]). Second, the devices in a WP-FEEL systemare workers cooperating in training a global model while those in a communication system aresubscribers competing for power transfer and the use of radio resources. Third, computing powerconsumption is either neglected or abstracted as a constant for conventional systems focusingon communication (see e.g., [5], [8], [13]). In contrast, such consumption is at least comparablewith its communication counterpart in a WP-FEEL system performing a computation intensivetask. Thus, an elaborate model of the former is adopted in this work so that the analysis can beof practical relevance.The above distinctions between WP-FEEL systems and their conventional counterparts giverise to new challenges in designing and analyzing the former. To tackle the challenges, the maincontribution of this work is the development of a novel analytical framework for quantifyingthe optimal learning-WPT tradeoff of a WP-FEEL system. As a by-product, a scheme for theoptimal control of local computation at devices is designed. The framework is ﬁrst developed forthe scenario where dense power-beacons are deployed to provide reliable WPT without fading,referred to as the beacon-WPT [22]. The key components of the framework and relevant ﬁndingsare described as follows.1) Distributed Gradient Estimation : Global and local gradient deviations are respectivelydeﬁned as the expected deviation of a local gradient estimate and the global estimate fromthe ground truth computed using the global/local datasets. In existing convergence analysis,they are usually studied under the following assumptions: 1) i.i.d. data distributions atdevices, 2) uniform mini-batch sizes, and 3) a ﬁxed number of active devices (see e.g., [25]–[29]). While the ﬁrst assumption lacks generality, the last two do not hold for the WP-FEELsystem featuring random harvested energy, the mentioned channel-dependent heterogeneous computation capacities, and a random number of active devices. To address the issue,we deﬁne a generalized system of global and local gradient deviations by relaxing theassumptions. By analyzing these measures, the convergence rate is related to the distributionof the set of active devices as well as the derived probability of a computation-outage event ,which occurs when a device fails to harvest sufﬁcient energy to support both communicationand computation and hence becomes inactive.2)

Local-Computation Optimization : Consider an active device and an arbitrary round. Afterreserving sufﬁcient transmission energy, the remaining harvested energy is used for localcomputation. Under the energy constraint, the mini-batch size and processor’s computingspeed are jointly optimized to minimize the local gradient deviation. They are shown toboth increase sub-linearly with the computation energy and be inversely proportional tothe device’s computation capacity. In addition, the optimal mini-batch size is also inverselyproportional to the workload for local gradient computation.3)

Optimal Learning-WPT Tradeoff : The tradeoff is derived based on characterizing the ef-fects of WPT on distributed gradient estimation and devices’ computation-and-communicationcapacities. Deﬁne the spatial energy density for beacon-WPT as the total energy transferredfrom beacons to a randomly located device per round, denoted as λ energy . The differencebetween the convergence rate and its limit in the ideal case of using the global dataset isfound to be inversely proportional to a sub-linear function of spatial energy density, namely O (cid:16) λ − energy (cid:17) . The result provides some guidelines on power-beacon deployment (i.e., powerand density) to provide a guaranteer on the learning performance. Moreover, the difference isshown to decay as a weighted sum of sub-linear functions of individual computation energyefﬁciencies (i.e., required energy for processing a data sample). Each weight depends onthe usefulness of a local dataset and speciﬁcally is the local gradient deviation for a singlesample. The result suggests the need of considering devices’ computation energy efﬁcienciesin WPT provisioning.The framework is extended to the other scenario of server-WPT , where the server transferspower to devices over fading channels [21]. In particular, the scaling laws described above remainthe same except that the spatial-energy density is replaced with the energy beamed by the serverto each device in a speciﬁc round.The remainder of the paper is organized as follows. Mathematical models are introduced inSection II. In Section III, distributed gradient estimation is analyzed to relate the convergence rate to the distribution and computation capacities of active devices. The optimal learning-WPTtradeoff for beacon-WPT is derived in Section IV and extended to the scenario of server-WPTin Section V. Experimental results are presented in Section VI, followed by concluding remarksin Section VII. II. M ATHEMATICAL M ODELS

We consider a single-cell WP-FEEL system in a circular cell with the radius denoted as R . A server equipped with an array of L antennas coordinates FEEL over K single-antennaedge devices, represented by the index set K = { , · · · , K } . Devices are assumed to have highmobility and their locations are uniformly distributed in the cell and i.i.d. over rounds. Thedevices are powered by either beacon-WPT or server-WPT as illustrated in Fig. 1. In each roundwith a ﬁxed duration T , each device ﬁrst computes a local gradient and then transmits it tothe server. Then each round is divided in two phases: local computation and gradient uploading(see Fig. 1), which last T cmp and T cmm seconds, respectively. The operations of devices aresynchronized, resulting in the following time constraints for edge devices: < t cmp k ≤ T cmp and < t cmm k ≤ T cmm , ∀ k ∈ K , (1)where t cmp k and t cmm k are the computation and transmission time at device k in one round,respectively. Let E k denote the amount of energy harvested by device k ; its computation andcommunication energy consumptions are represented by E cmp k and E cmm k , respectively. The har-vested energy is ﬁxed for beacon-WPT and varies over rounds for server-WPT as elaborated inthe sequel. They satisfy the following energy constraint: E cmp k + E cmm k ≤ E k , ∀ k ∈ K . (2)The detailed system operations and relevant models are described as follows. A. Two WPT Models1) Beacon-WPT:

Consider the WP-FEEL system powered by beacon-WPT in Fig. 1(a). Giventheir low cost and complexity, dense power-beacons are deployed to power devices over short-range WPT links without fading. The beacons are modelled as a homogeneous

Poisson point To be precise, the idling circuit energy consumption, denoted as a constant ζ , exists even when there is no computation andtransmission. In this case, the energy constraint is E cmp k + E cmm k + ζ ≤ E k . We omit the constant ζ as it is negligible comparedwith E cmp k and E cmm k . process (PPP), denoted as Ψ = { s } with density λ pb , where s ∈ R represent the coordinate of asingle beacon. Each device is equipped with an energy harvester comprising a rectifying antennaand a battery [19]. Moreover, WPT is over a dedicated frequency outside the communicationband. These allow the device to continuously harvest energy throughout the learning process[see Fig. 1(a)]. Let the coordinates of device k in the i -th round be denoted by r ( i ) k and thusthe communication range r ( i ) k = | r ( i ) k | . Adopting a short-range propagation model [30], theinstantaneous power received at the device k in round i is given as P ( i ) k = ρ ¯ P (cid:88) s ∈ Ψ (cid:16) max {| r ( i ) k − s | , ν } (cid:17) − β , ∀ k ∈ K , (3)where ν ≥ is a given constant avoiding singularity, β > is the path-loss exponent, ¯ P isthe transmission power of power-beacons, and ρ represents the product of energy-conversionefﬁciency and energy-beamforming gain. As the power-beacons are dense and homogeneouslydistributed in the cell, the amount of harvested energy at each device in one round can beapproximated as [22] E ( i ) k ≈ ¯ E = πβρ ¯ P λ pb T ( β − ν β − , ∀ k ∈ K . (4) Deﬁnition 1. (Spatial-Energy Density)

The spatial-energy density is deﬁned as λ energy (cid:44) ¯ P λ pb T ,which is proportional to the beacon density and transmission power. It can be interpreted as theamount of energy delivered by the power-beacon network to an arbitrarily located device in asingle round.

2) Server-WPT:

In the absence of power-beacons, devices can be also powered by the serverover long-range WPT links with fading as illustrated in Fig. 1(b). The server is assumed to behalf-duplex and thus can perform WPT only during the local-computation phase in each roundwhen its array is not used for communication [see Fig. 1(b)]. Let the isotropic complex Gaussianvector (cid:101) h ( i ) k ∈ C L × represent the Rayleigh fading channel of the WPT link from the server todevice k . Moreover, let u ( i ) k ∈ C L × with ( u ( i ) k ) H u ( i ) k = 1 denote the energy-beamforming vector,and (cid:101) P ( i ) k the transfer power allocated to device k . With energy beamforming, the amount ofharvested energy by device k in each round is given as E ( i ) k = ρ (cid:0) r ( i ) k (cid:1) − α (cid:107) (cid:101) h ( i ) k (cid:107) (cid:101) P ( i ) k T cmp , ∀ k ∈ K . (5)Last, the WPT channels are assumed to be i.i.d. over rounds and furthermore independent ofthe uplink channels since they are in different frequency bands. (a) Beacon-WPT System and Operations(b) Server-WPT System and OperationsFigure 1. WP-FEEL systems and operations: (a) beacon-WPT and (b) server-WPT. B. Federated Learning Model

A standard federated learning framework is considered as follows (see e.g., [31]). A globalmodel, represented by the parametric vector w ∈ R q with q denoting the model size, is trainedcollaboratively across the edge devices by leveraging local labelled datasets. For device k , let D k = { ( x j , y j ) } denote the local dataset where x j and y j represent the raw data and label ofthe j -th sample. The local loss function is deﬁned as F k ( w ) = 1 |D k | (cid:88) ( x j ,y j ) ∈D k (cid:96) ( w ; ( x j , y j )) , (6)where (cid:96) ( w ; ( x j , y j )) is the sample-wise loss function quantifying the prediction error of themodel w on the training sample x j with reference to its true label y j . For convenience, wedenote (cid:96) ( w ; ( x j , y j )) as (cid:96) j ( w ) and assume uniform sizes for local datasets: |D k | = D, ∀ k ∈ K .Then the global loss function on all the distributed datasets can be written as F ( w ) = (cid:80) Kk =1 (cid:80) j ∈D k (cid:96) j ( w ) (cid:80) Kk =1 |D k | = 1 K K (cid:88) k =1 F k ( w ) . (7) Its gradient ∇ F ( w ( i ) ) is referred to as the ground-truth gradient . The learning process is tominimize the global loss function F ( w ) . To this end, each round aims at estimating ∇ F ( w ( i ) ) distributively to facilitate SGD.We adopt the existing gradient-averaging implementation of FEEL with the key operationsillustrated in Fig. 2 and described as follows (see e.g., [9]). In each round, say the i -th round,the server broadcasts the current model w ( i ) to all edge devices. Due to channel fading, onlya subset of devices, denoted as a subset M ( i ) ⊆ K with size M ( i ) = |M ( i ) | , can participate inlearning in this speciﬁc round. Each device in M ( i ) computes a local estimate of the gradientof its local loss function by randomly sampling its local dataset D k . We denote the sampledmini-batch local dataset as B ( i ) k whose size is denoted by b ( i ) k = |B ( i ) k | . The local gradient atdevice k in the i -th round is estimated using the mini-batch as g ( i ) k = 1 b ( i ) k (cid:88) ( x j ,y j ) ∈B ( i ) k ∇ (cid:96) j ( w ( i ) ) . (8)Upon completion, the local gradient estimates are sent by active devices to the server foraggregation. Upon receiving them, the global gradient is calculated as g ( i ) =  M ( i ) (cid:80) k ∈M ( i ) g ( i ) k , M ( i ) > , M ( i ) = 0 . (9)Subsequently, the global model is then updated using SGD as w ( i +1) = w ( i ) − η g ( i ) , (10)where η is the given learning rate. The process iterates until the model converges. In the process,the accuracy of distributed gradient estimation can be measured using the following metrics. Deﬁnition 2. (Local and Global Gradient Deviations).

In the i -th round, the local gradientdeviation at device k , denoted as G ( i ) lo ,k , refers to the mean-square error between the local gradientestimate and its ground-truth: G ( i ) lo ,k = E (cid:104) (cid:107) g ( i ) k − ∇ F k ( w ( i ) ) (cid:107) (cid:105) . (11)The global gradient deviation refers to the expected deviation between the aggregated localgradient estimates at the server and the ground truth: G ( i ) gl = E M ( i ) (cid:8) E (cid:2) (cid:107) g ( i ) − ∇ F ( w ( i ) ) (cid:107) (cid:3)(cid:9) , (12)where the outer expectation is taken over rounds and the distributions of M ( i ) and b ( i ) k . ... Model Broadcast Gradient AggregationLocal Gradient Upload Model UpdateEdge ServerMini-batchData Device 1GradientDevice KModelProcessor Communication LinksLocal Computation

Figure 2. FEEL Operations.

C. Local-Computation Model

The computation-energy consumption depends on two variables: 1) the mini-batch size and 2)the processor’s clock frequency. Adopting a standard model in computer engineering [32], wedeﬁne the per-sample workload W for local-gradient estimation as the number of ﬂoating pointoperations (FLOPs) needed for processing each data sample. This gives the workload at device k in the i -th round as W ( i ) k, total = b ( i ) k × W . Let f ( i ) clk ,k [in cycle/s] represent the clock frequency ofthe processor (e.g., CPU or GPU) at device k in round i . As a result, the computing speed of theprocessor, measured in FLOPs per second, can be deﬁned as f ( i ) k = f ( i ) clk ,k × N FLOP k with N FLOP k denoting the number of FLOPs it can process per cycle. Given the workload and computingspeed, the local computation time at device k , denoted as t cmp k ( i ) , is given by t cmp k ( i ) = b ( i ) k Wf ( i ) k , ∀ k ∈ K . (13)For a CMOS circuit, the power consumption of a processor can be modelled as a function ofclock frequency: P = Ψ f clk , where Ψ [in Watt / ( cycle/s ) ] is a constant depending on the chiparchitecture [33]. Based on this model, the power consumption of the processor at device k canbe written as P cmp k ( i ) = Ψ k (cid:0) f ( i ) clk ,k (cid:1) = C k (cid:0) f ( i ) k (cid:1) , ∀ k ∈ K , (14)where the coefﬁcient C k = Ψ k / (cid:0) N FLOP k (cid:1) characterizes the computation property of the pro-cessor. In particular, a smaller value indicates that the processor is capable to compute moreworkload given energy consumption per unit time, or consume less energy given the workloadper unit time. Given (13) and (14), the resultant energy consumption at device k is given as E cmp k ( i ) = P cmp k ( i ) × t cmp k ( i ) = b ( i ) k C k W (cid:0) f ( i ) k (cid:1) , ∀ k ∈ K . (15) D. Transmission Model

Without loss of generality, consider uploading by device k in the i -th round. Each gradi-ent coefﬁcient is compressed into Q bits such that the effect of quantization on the learningperformance is negligible. Then the overhead of transmitting a q -dimensional vector is q × Q bits. The uplink bandwidth is equally divided into K narrow sub-bands of B and allocated tothe devices for orthogonal transmission. Let the complex Gaussian vector h ( i ) k comprising i.i.d. CN (0 , coefﬁcients represent the Rayleigh fading channel of the considered devices. Channelsof different devices are assumed independent of each other. Given receive beamforming at theserver, the transmission rate for device k in round i can be written as S ( i ) k = B log (cid:32) (cid:107) h ( i ) k (cid:107) P cmm k ( i ) (cid:0) r ( i ) k (cid:1) α BN (cid:33) , ∀ k ∈ K . (16)where P cmm k ( i ) represents the transmission power, N the power spectrum density of the additivewhite Gaussian noise, r ( i ) k the propagation distance, and α the path-loss exponent. The trans-mission rate is required to support uploading of q × Q bits in a single round. This places thefollowing constraint on the transmission power: P cmm k ( i ) = (cid:0) r ( i ) k (cid:1) α BN (cid:107) h ( i ) k (cid:107) (cid:18) qQBt cmm k ( i ) − (cid:19) , ∀ k ∈ K . (17)The resultant transmission-energy consumption is E cmm k ( i ) = P cmm k ( i ) × t cmm k ( i ) = (cid:0) r ( i ) k (cid:1) α (cid:107) h ( i ) k (cid:107) ϕ ( t cmm k ( i ) ) , ∀ k ∈ K , (18)where the function ϕ ( t ) (cid:44) BN t (cid:16) qQBt − (cid:17) .III. C ONVERGENCE AND D ISTRIBUTED G RADIENT E STIMATION

Consider the WP-FEEL system with beacon-WPT. In this section, we aim at analyzingthe relation between convergence and several key system variables inﬂuencing the accuracyof distributed gradient estimation, including the mini-batch sizes of devices, number of activedevices, and computation-outage probability. Such results are useful for deriving the learning-WPT tradeoff in the next section. As direct analysis is difﬁcult, a tractable approach is adoptedusing global and local gradient deviations as intermediate variables. A. Convergence and Global Gradient Deviation

To quantify the relation, we follow the literature to make several standard assumptions on theloss function and local estimated gradients as follows (see e.g., [4], [26]–[28], [34]–[37]).

Assumption 1. (Smoothness).

The loss function F : R q → R is µ -smooth . Speciﬁcally, for all ( u , v ) ∈ R q × R q , F ( u ) ≤ F ( v ) + (cid:104)∇ F ( v ) , u − v (cid:105) + µ (cid:107) u − v (cid:107) , (19)where ∇ is the differential operator and (cid:104)· , ·(cid:105) represents the inner product.While i.i.d. data distributions over devices are commonly assumed in the literature for simplic-ity (see e.g., [25]), we consider the general and more practical case of non-i.i.d. data distributionas in [35]. For the case of i.i.d. distributions, local gradients are unbiased with respect to theglobal full-batch gradient: E [ g k ] = ∇ F ( w ) , ∀ k ∈ K , where the expectation is taken over thedata distribution at device k . On the other hand, for the current case of non-i.i.d. distributions,we make the following assumption on local gradient estimates [35]. Assumption 2. (Local Gradients Estimation).

The stochastic gradient estimates { g k } deﬁned in(8) are unbiased estimates of its local gradients {∇ F k ( w ) } deﬁned in (6) and computed usingfull local datasets, and independent of each other: E [ g k ] = ∇ F k ( w ) , ∀ k ∈ K , (20)where the expectation is taken over local data distribution D k .It should be reiterated that local gradients are not equal to the global gradient and theirrelation is speciﬁed in (7). Our analysis does not require the convexity assumption for the lossfunction and only requires it to be lower bounded as formally stated below, which is the minimalassumption needed for ensuring convergence to a stationary point [36]. Assumption 3. (Bounded Loss Function).

For any parameter vector w , the loss function F ( w ) is lower bounded by a given scalar F ∗ .The last assumption given below is also standard in the literature (see e.g., [26]–[28]). Assumption 4. (Bounded Gradient Norm).

The expected squared norm of stochastic gradientsis uniformly bounded by a constant Φ , that is, E (cid:104) (cid:107) g ( i ) k (cid:107) (cid:105) ≤ Φ , ∀ k ∈ K and ∀ i . We adopt a widely metric for measuring the convergence rate of FEEL with a non-convexloss function, namely the expected average gradient norm (over rounds) [29], [34], [37]. Basedon the above assumptions, we prove that expected average gradient norm can be bounded bythe average global gradient deviation as shown in the following proposition, thereby relatingconvergence to gradient estimation.

Proposition 1.

Given the learning rate satisfying < η ≤ µ , the expected convergence rate ofthe FEEL algorithm can be upper-bounded by the average global gradient deviation as follows E (cid:34) N N − (cid:88) i =0 (cid:107)∇ F ( w ( i ) ) (cid:107) (cid:35) ≤ (cid:2) F ( w (0) ) − F ∗ (cid:3) ηN + 1 N N − (cid:88) i =0 G ( i ) gl . (21)where G ( i ) gl is deﬁned in (12). Proof:

See Appendix A. (cid:3)

B. Computation-Outage Probability

The computation-outage probability that affects the global gradient deviation is derived asfollows. Without loss of generality, consider device k . Deﬁnition 3. (Computation-Outage Event).

The event occurs at device k in the i -th round whenits harvested energy ¯ E is no larger than the required transmission energy E cmm k ( r ( i ) k , h ( i ) k ) giventhe propagation distance r ( i ) k and fading channel h ( i ) k . As a result, there is zero energy forcomputation and device k is inactive in round i : k / ∈ M ( i ) .It is well known that the gain of the Rayleigh fading channel, (cid:107) h ( i ) k (cid:107) , follows the χ -distribution with the following probability density function (PDF): f (cid:107) h ( i ) k (cid:107) ( h ) = h L − e − h Γ( L ) , h ≥ , (22)where Γ( · ) is the Gamma function. On the other hand, since each device is uniformly distributedin the cell, the transmission distance of device k has the following PDF: f r ( i ) k ( r ) = 2 rR , ≤ r ≤ R. (23)Next, since transmission energy is a monotone decreasing function of the transmission du-ration, to minimize energy consumption requires the use of the maximum transmission du-ration: t cmm k ( i ) = T cmm . Then the required transmission energy in round i is E cmm k ( r ( i ) k , h ( i ) k ) = ( r ( i ) k ) α (cid:107) h ( i ) k (cid:107) ϕ ( T cmm ) . Using the above results, the computation-outage probability is derived as follows. Lemma 1. (Computation-Outage Probability).

The probability is identical for all devices andall round and given as P out = γ ( L, ξ ) − ξ − α γ ( L + α , ξ )Γ( L ) , (24)where γ ( · , · ) is the lower incomplete Gamma function, and the parameter ξ is deﬁned as ξ (cid:44) ϕ ( T cmm ) R − α ¯ E = ( β − ν β − R α ϕ ( T cmm ) πβρ ¯ P λ pb T . (25)

Proof:

See Appendix C. (cid:3)

The parameter ξ deﬁned in (25) is a key parameter inﬂuencing P out . The asymptotic scalingsof P out with respect to ξ are characterized in the following corollary of Lemma 1. Corollary 1.

The computation-outage probability P out ( ξ ) is a monotone increasing function ofthe parameter ξ . Asymptotically, P out ( ξ ) scales with respect to ξ as follows: lim ξ → P out ( ξ ) ξ L = 2( αL + 2)Γ( L + 1) and lim ξ →∞ − P out ( ξ ) ξ − α = Γ( L + α )Γ( L ) . (26)Based on the deﬁnition of ξ in (25), the above scaling laws suggest that the computation-outage probability can be reduced by 1) enhancing the harvested energy ¯ E via increasing thedensity and transmission power of power-beacons or 2) decreasing the cell size. Remark 1. (Active and Idle Rounds).

A round is idle with learning paused when all devices arein outage (i.e., M = 0 ). The probability of idling round is Pr( M = 0) = P K out and that of activeround is Pr(

M >

0) = 1 − P K out . C. Effects of System Variables on Convergence

Given the result in Proposition 1, characterizing the effects requires only the analysis of therelation between the global gradient deviation and the system parameters. To this end, consideran arbitrary round and the superscripts i of variables, which specify the round index, are omittedin the remainder of the sub-section to simplify notation. An arbitrary active device, say device k , randomly draws a mini-batch of b k samples with the index set B k = { j , · · · , j b k } . Then thelocal gradient estimate can be written as g k = b k (cid:80) j ∈B k ∇ (cid:96) j ( w ) . Its distribution is speciﬁed inthe following lemma. Lemma 2. (Distribution of Local Gradient Estimate [38] ). At active device k , the ﬁrst twomoments of the local gradient estimate are given as E [ g k ] = ∇ F k ( w ) and Var [ g k ] = Ω k b k with Ω k being a constant deﬁned as Ω k (cid:44) D (cid:80) j ∈D k ∇ (cid:96) j ( w ) ∇ (cid:96) j ( w ) T − ∇ F k ( w ) ∇ F k ( w ) T . It follows that the local gradient deviation at device k can be written as G lo ,k = σ k b k with itssingle-sample variance σ k (cid:44) tr (Ω k ) . Based on Assumptions 2 and 3 and Lemma 2, we obtainthe following useful result. Lemma 3.

The global gradient deviation in an arbitrary round can be bounded as G gl ≤ K E (cid:20)(cid:88) k ∈M σ k b k (cid:21) + 2 (cid:26)(cid:0) − P K out (cid:1) (cid:18) E (cid:20) M (cid:12)(cid:12)(cid:12) M > (cid:21) − K (cid:19) + P out (cid:27) Φ , (27)where P out is given in (24). Proof:

See Appendix B. (cid:3)

Combining Proposition 1 and Lemma 3 gives the main result of this sub-section.

Proposition 2.

Given the learning rate satisfying < η ≤ µ , the expected convergence rate ofthe FEEL algorithm satisﬁes E (cid:34) N N − (cid:88) i =0 (cid:107)∇ F ( w ( i ) ) (cid:107) (cid:35) ≤ (cid:2) F ( w (0) ) − F ∗ (cid:3) ηN + 2 K E (cid:20)(cid:88) k ∈M σ k b k (cid:21) + 2 (cid:26)(cid:0) − P K out (cid:1) (cid:18) E (cid:20) M (cid:12)(cid:12)(cid:12) M > (cid:21) − K (cid:19) + P out (cid:27) Φ . (28)The above result relates convergence to several key system parameters, including the mini-batch sizes, number of active devices and its distribution, and computation-outage probability.In addition, one can observe that the upper bound in Proposition 2 is identical for all roundsand thus simplify the subsequent analysis.IV. O PTIMAL L EARNING -WPT T

RADEOFF FOR B EACON -WPTIn this section, we ﬁrst give the optimized local computation policy, which determines howmany samples the active devices can process in this round. Then, the global gradient deviationis derived by exploring the two factors that affect the global gradient deviation, respectively. Thelearning-energy tradeoff for beacon-WPT is characterized by the relation between convergencerate and spatial-energy density from the power-beacon network.

A. Local Computation Optimization

The local-computation variables at each device can be optimized to maximize the accuracyof local gradient estimation. On other hand, increasing the local batch size for model trainingreduces the local gradient deviation, but it increases local energy consumption according to (15). Moreover, under the time constraint, processing more samples requires boosting the computingspeed, which also contributes to the energy growth. Thus it is useful to control the two variables,sampled batch size and computing speed, under the criterion of minimum local gradient deviation.Considering device k without loss of generality, since the local gradient deviation, G lo ,k , givenin (11) is inversely proportional to the sampled batch size, b k , the optimization problem can beformulated as ( P1 ) max { b k ,f k } b k s.t. < b k C k W f k ≤ E cmp k , < b k Wf k ≤ T cmp . For tractability, the batch size b k is relaxed to be continuous. This is reasonable as the batch sizefor model training is usually large (e.g., thousands of images). With the relaxation, the optimalpolicy is derived in closed-form as shown below. Lemma 4. (Optimal Local-Computation Policy).

For optimal local gradient estimation at eachactive device, the optimal sampled batch size and the computing speed should be set as follows: b (cid:63)k = 1 W (cid:18) E cmp k ( T cmp ) C k (cid:19) and f (cid:63)k = (cid:18) E cmp k C k T cmp (cid:19) , ∀ k ∈ M . (29)The proof involves straightforward application of the Karush–Kuhn–Tucker (KKT) conditionsand is omitted for brevity.Given the optimal policy, the expectation of the resultant local gradient estimate at an activedevice is derived as follows. First, the expectation can be expressed in terms of the computationenergy budget { E cmp k } as E (cid:2) G (cid:63) lo ,k (cid:12)(cid:12) k ∈ M (cid:3) = W ( T cmp ) E E cmp k (cid:34) σ k C k ( E cmp k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k ∈ M (cid:35) . (30)Since the transferred energy is equal to the sum of computation and transmission energy, wecan write { E cmp k } as the function of propagation distance and channel state as E cmp k ( r k , h k ) =¯ E − r αk (cid:107) h k (cid:107) ϕ ( T cmm ) . It follows from (30) that Pr( k ∈ M ) E (cid:2) G (cid:63) lo ,k | k ∈ M (cid:3) = W ( T cmp ) (cid:90) (cid:90) Θ σ k C k (cid:0) ¯ E − E cmm k ( r, h ) (cid:1) h L − e − h Γ( L ) 2 rR drdh, (31)where the integral domain is deﬁned as Θ = { ( r, h ) ∈ R : ¯ E > E cmm k ( r, h ); h ≥

0; 0 ≤ r ≤ R } .Based on (31), we obtain the following result. Lemma 5. (Optimal Local Gradient Deviation).

Given optimal local computation in Lemma 4,the expectation of the local gradient deviation at each active device can be bounded as E (cid:2) G (cid:63) lo ,k | k ∈ M (cid:3) ≤ W σ k C k B ( , α ) α ( T cmp ) ¯ E , (32)where B ( · , · ) is the Beta function, and ¯ E is the harvested energy at one device given in (4). Proof:

See Appendix D. (cid:3)

B. Optimal Learning-WPT Tradeoff

So far we have analyzed two factors, local gradient deviation and computation-outage proba-bility, which affect the global gradient deviation. To derive the desired learning-energy tradeoffwith optimal local computation, we need to quantify the last factor, E (cid:104) M (cid:12)(cid:12)(cid:12) M > (cid:105) , with M being the number of active devices (see Proposition 3). This factor represents the fact that moreactive devices help improve the accuracy of global gradient estimation. To analyze the factor,whether each device participates in a round can be represented by a Bernoulli random variablewith the parameter (1 − P out ) , namely I k ∈M ∼ Ber (1 − P out ) , ∀ k ∈ K , where I k ∈M denotes theindicator whose value is if k ∈ M , or otherwise. It follows that M ∼ Binom ( K, − P out ) ,which follows the Binomial distribution. The truncated version of the probability mass functionwith M > is given as Pr( M = m ) = 11 − P K out (cid:18) Km (cid:19) (1 − P out ) m P K − m out , m = 1 , · · · , K. (33)Using the distribution and a result from [39], the desired factor can be obtained as shown in thefollowing lemma. Lemma 6. (Expected Reciprocal [39] ). The expected reciprocal of the number of active devices, M , can be written in terms of the computation-outage probability as follows: E (cid:20) M (cid:12)(cid:12)(cid:12) M > (cid:21) = 11 − P K out K (cid:88) m =1 P m − out − P K out K − m + 1 . (34)The global gradient deviation G gl can be obtained from Lemma 3 by substituting the resultsin Lemma 5 and 6, where we emphasize that Pr( k ∈ M ) = 1 − P out and E (cid:34) (cid:88) k ∈M G (cid:63) lo ,k (cid:35) = K (cid:88) k =1 E (cid:2) I k ∈M G (cid:63) lo ,k (cid:3) = K (cid:88) k =1 Pr( k ∈ M ) E (cid:2) G (cid:63) lo ,k | k ∈ M (cid:3) . (35)Then substituting G gl into Proposition 1 completes the proof. Using the preceding results, the optimal learning-energy tradeoff for the case of beacon-WPTcan be readily derived as shown in the following theorem.

Theorem 1. (Convergence with Beacon-WPT).

Consider the case of beacon-WPT. Given theoptimal local computation in Section IV-A, the convergence rate of WP-FEEL is bounded by E (cid:34) N N − (cid:88) i =0 (cid:107)∇ F ( w ( i ) ) (cid:107) (cid:35) ≤ (cid:2) F ( w (0) ) − F ∗ (cid:3) ηN + δW C σ (1 − P out ) ρ K ( T cmp ) λ energy + R es , (36)where λ energy is the spatial-energy density, the parameter C σ (cid:44) K (cid:80) Kk =1 σ k C k , the constant δ (cid:44) α B ( , α ) (cid:16) ( β − ν β − πβ (cid:17) , the computation-outage probability P out speciﬁed in Lemma 1, andthe residue term R es given as R es = 2 (cid:32) K (cid:88) m =2 P m − out − P K out K − m + 1 + P out (cid:33) Φ . (37)The three terms on the right-hand side of (36) are explained as follows. The ﬁrst termrepresents gradient descent using the ground-truth gradients. The second term, which arises fromthe global gradient deviation, reﬂects the effects of beacon-WPT and other system parameterson the convergence rate. The effects of individual parameters are discussed as follows. • (WPT effect). Recall that the spatial-energy density, λ energy = ¯ P λ pb T , refers to the amount ofenergy transferred by the power-beacon network to a unit area in a single round. Increasingthe density leads to a linear growth of energy harvested by each device. As a result, activedevices can estimate the local gradient with higher accuracies by using larger mini-batchsizes. This leads to the decrease of the global gradient deviation proportionally with λ − energy . • (Local-computation properties). The computation parameters are grouped in W C σ ( T cmp ) . It canbe interpreted as the reduction of global gradient deviation by processing a single sampleat each device. The interpretation can be derived mathematically by writing W C σ ( T cmp ) = 1 K K (cid:88) k =1 σ k (cid:20) C k W ( T cmp ) (cid:21) (cid:44) K K (cid:88) k =1 σ k ( e cmp k ) , (38)where e cmp k represents the required energy for per-sample computation at device k . While theabove quantity quantiﬁes the per-sample gain of distributed gradient estimation, the numberof samples each device is capable of processing depends on the spatial-energy density ofbeacon-WPT discussed earlier. • (Number of devices). Due to the averaging operation in (9), increasing the number ofdevices, K , reduces the global gradient deviation following the scaling law of O (cid:0) K (cid:1) . • (Probability of participation). The quantity (1 − P out ) represents the probability of partici-pating in learning by each device. As adding an active device contributes its local gradientdeviation to the global counterpart, increasing (1 − P out ) seems to enlarge the latter. On thecontrary, aligned with intuition, the overall effect is to reduce the global gradient deviationif taking into account the last term, R es , which is also part of the deviation and decreasesas (1 − P out ) grows.The last term R es captures the loss of convergence rate caused by those devices in computation-outage. If P out is small, R es scales as O ( P out ) , conﬁrming the effect mentioned above that R es decreases as the probability of participation, (1 − P out ) , grows.The mentioned effect of WPT on the convergence of WP-FEEL can be mathematicallyquantiﬁed in the following corollary of Theorem 1. Corollary 2. (Effect of WPT).

Consider the case of beacon-WPT. As the spatial-energy density λ energy grows, the convergence rate increases as follows: E (cid:34) N N − (cid:88) i =0 (cid:107)∇ F ( w ( i ) ) (cid:107) (cid:35) ≤ (cid:2) F ( w (0) ) − F ∗ (cid:3) ηN + Υ λ − energy + O (cid:0) λ − L energy (cid:1) , λ energy → ∞ , (39)where the parameter Υ (cid:44) δW C σ ρ K ( T cmp ) , and the outage probability scales as P out = O ( λ − L energy ) .V. E XTENSION TO S ERVER -WPTThe preceding sections focus on beacon-WPT. The results therein can be extended to the caseof server-WPT by accounting for fading in WPT links. Given channel-state information, thepower allocation at the server for transfer to difference devices can be optimized in the sequel.

A. Optimal Learning-WPT Tradeoff

In the scenario of server-WPT with fading in the WPT links, the convergence analysis ismore tedious than the beacon-WPT counterpart. Speciﬁcally, the current analysis differs from itsbeacon-WPT counterpart in two factors: 1) computation-outage probability accounting for fadingin a WPT link, and 2) harvested energy that is now random. For tractability, assume equal powerallocation WPT, namely that each energy beam has ﬁxed power of P (Section V-B for powercontrol). First, the computation-outage probability, denoted as P (cid:48) out , is derived as follows. The deﬁnition of a computation-outage event can be modiﬁed from that in Deﬁnition 3 by replacing ¯ E with E k ( r ( i ) k , (cid:107) (cid:101) h ( i ) k (cid:107) ) . Speciﬁcally, a computation-outage event occurs if (cid:107) (cid:101) h ( i ) k (cid:107) (cid:107) h ( i ) k (cid:107) (cid:0) r ( i ) k (cid:1) α ≤ ϕ ( T cmm ) ρP T . (40)To simplify notation, we deﬁne the product random variable X = (cid:107) (cid:101) h ( i ) k (cid:107) (cid:107) h ( i ) k (cid:107) . Since bothvariables in the product follow the χ -distribution with L degrees-of-freedom, X has thefollowing distribution: f X ( x ) = (cid:90) ∞ f (cid:107) (cid:101) h k (cid:107) ( h ) f (cid:107) h k (cid:107) ( x/h ) 1 h dh = 2 x L − K (2 √ x )Γ( L ) , (41)where K ( · ) is the modiﬁed Bessel function of the second kind. Combining (40) and (41) gives P (cid:48) out = (cid:90) τ x L − K (2 √ x ) (cid:16) − ( x/τ ) α (cid:17) Γ( L ) dx, (42)where the parameter τ is deﬁned as τ (cid:44) R α ϕ ( T cmm ) ρP T cmp . As the exact expression of P (cid:48) out has noclosed form, we derive an upper bound for tractability. Corollary 3.

For large transferred power ( P (cid:29) ), the computation-outage probability in thecase of server-WPT can be bounded as: P (cid:48) out ≤ τ L Γ( L ) L (1 + αL ) ln 1 τ + O (cid:0) P − L (cid:1) . (43) Proof:

See Appendix E. (cid:3)

Given the above result, the proof of Theorem 1 can be straightforwardly extended to thecurrent case by modifying the expressions of computation-outage probability and harvestedenergy, yielding the following main result of this section.

Theorem 2. (Convergence with Server-WPT).

Consider the case of server-WPT. If the transmis-sion power for WPT to each device is large ( P (cid:29) ), the convergence rate of WP-FEEL isbounded as E (cid:34) N N − (cid:88) i =0 (cid:107)∇ F ( w ( i ) ) (cid:107) (cid:35) ≤ F ( w (0) ) − F ∗ ] ηN + (cid:101) δW C σ ρ KT cmp P + O ( P − L ln P ) , P → ∞ , (44)where the constant (cid:101) δ is deﬁned as (cid:101) δ = B ( , + α )Γ( L + ) R α α Γ( L ) , and the upper-bound of the computation-outage probability P (cid:48) out as shown in (43) scales as O ( P − L ln P ) . Comparing Theorems 1 and 2, the convergence-rate bound for server-WPT has a similar formas that for beacon-WPT except for two differences. First, the spatial energy density in the latteris replaced with transmission energy per-round for WPT, P T cmp . Second, the scaling law of thelast (residual) term in (44) differs from that in (36) due to WPT-link fading. The effects of otherparameters are identical to those discussed in Section IV-B. B. Optimizing Server-WPT

The ﬁxed power allocation for WPT in the preceding sub-section can be relaxed to improvethe convergence performance. While more sophisticated designs are possible (e.g., involvingoptimization of the number of active devices), we consider the following practical two-stepscheme to be applied in each round (with the index i omitted in the sequel to simplify notation). • Step 1 (Scheduling): Let (cid:101) P k denote the power allocated for WPT to device k . Consideringequal power allocation ( (cid:101) P k = P ), select the set of active devices, M (cid:48) , by applying thecomputation-outage criterion in (40). • Step 2 (Optimal Power Control): Given M (cid:48) and the number of active devices M (cid:48) = |M (cid:48) | ,optimize the power allocation under the sum power constraint (cid:80) k ∈M (cid:48) (cid:101) P k ≤ M (cid:48) P tominimize sum-local-gradient deviation.For Step 2, given M (cid:48) and (30), the sum-local-gradient deviation is given as (cid:88) k ∈M (cid:48) G (cid:63) lo ,k = W ( T cmp ) (cid:88) k ∈M (cid:48) σ k C k ( E cmp k ) . (45)By substituting { E cmp k } into (45), the problem of optimal power control can be formulated as ( P3 ) min { (cid:101) P k } W ( T cmp ) (cid:88) k ∈M (cid:48) σ k C k (cid:16) ρ (cid:107) (cid:101) h k (cid:107) (cid:101) P k T cmp r αk − r αk ϕ ( T cmm ) (cid:107) h k (cid:107) (cid:17) s.t. (cid:88) k ∈M (cid:48) (cid:101) P k ≤ M (cid:48) P . By a straightforward application of the KKT conditions, the optimal power-allocation policy isderived in closed-form as: (cid:101) P (cid:63)k = r αk ϕ ( T cmm ) ρ (cid:107) (cid:101) h k (cid:107) (cid:107) h k (cid:107) T cmp + r α k σ k C k (cid:107) (cid:101) h k (cid:107) θ ( P − ς ) , ∀ k ∈ M (cid:48) , (46)where θ = 1 M (cid:48) (cid:88) k ∈M r α k σ k C k (cid:107) (cid:101) h k (cid:107) and ς = 1 M (cid:48) (cid:88) k ∈M r αk ϕ ( T cmm ) ρ (cid:107) (cid:101) h k (cid:107) (cid:107) h k (cid:107) T cmp , (47) and thus omitted for brevity. One can observe from (46) that the optimal power allocated forWPT sums two components. The ﬁrst component, which supports gradient uploading, is inverselyproportional to the gain of the close-loop channel cascading downlink for WPT and uplink forgradient uploading. The other component, which supports local computation, depends only onthe WPT link and is a monotone decreasing function of the channel again of the WPT link.VI. E XPERIMENTAL R ESULTS

A. Experimental Settings

The default settings are as follows. The edge server is equipped with antennas and thecell radius is set as R = 100 m. There are K = 30 edge devices uniformly distributed in thecell, each of which is allocated an uplink bandwidth B = 1 MHz. The noise spectrum densityis N = − dBm/Hz and the path loss exponents are α = 3 . and β = 4 . For the WPTpath-loss model, ν = 1 . The processor coefﬁcients { C k } are chosen by uniformly sampling theset { . , . , · · · , . } [in Watt · ( MFLOPs / s ) − ]. The learning task aims at training aCNN model to classify handwritten digits using the well-known MNIST dataset. For non-i.i.d.data distribution, we ﬁrst arrange × data samples according to their labels, follow thesample sequence to divide the dataset into 60 subsets each of size 1000, and assign each of 30devices 2 data subsets. The classiﬁer model is implemented using a 6-layer CNN which consistsof two × convolution layers with ReLU activation, each followed by × max pooling,a fully connected layer with units and ReLU activation, and a ﬁnal softmax output layer.The total number of parameters is q = 21 , and the per-sample computation workload is W = 1 . × FLOPs. Furthermore, we suppose that each parameter of the training modelgradient is quantized into Q = 16 bits, and as a result, the transmission overhead in one round is . × bits. We ﬁx the number of rounds as N = 500 and evaluate the learning performancein terms of average gradient norm (over rounds) and test accuracy. B. Performance of WP-FEEL with Beacon-WPT

The curves of learning performance versus the spatial-energy density λ energy provided by thepower-beacon network are plotted in Fig. 3 for a varying path-loss exponent of WPT links, β .Both experimental and analytical results are presented. Several observations can be made. As λ energy grows, the increase of transferred energy allows more devices to participate in learningor equivalently more distributed data to be exploited for model training. Consequently, one can -1 Spatial energy density energy (J/m ) A v e r age g r ad i en t no r m = 3 (simulation) =3 (analysis) = 5 (simulation) = 5 (analysis) (a) Average gradient norm -1 Spatial energy density energy (J/m ) T e s t a cc u r a cy ( % ) = 3 = 5 (b) Test accuracyFigure 3. Effects of the spatial-energy density on the performance of a WP-FEEL system with beacon-WPT. observe that both the average gradient norm and test accuracy saturate as they converge totheir ground-truths. Next, before the convergence, the average gradient norms from analysis andexperiments follow the same scaling laws. That validates the analytical model and the results inTheorem 1 and its corollary. Last, one can observer that a larger value of β and hence smallerpath loss results in better performance as more energy can be transferred from beacons to devices.Deﬁne the computation-energy rate of a device as the number of FLOPs computed by itsprocessor per unit energy consumption. For ease of exposition, consider the case of uniformcomputation-energy rates for all devices. The curves of learning performance versus computation-energy rates are plotted in Fig. 4 for a varying noise power spectrum density N . Several -4 -3 -2 -1 Computation energy rate [J/sample] A v e r age g r ad i en t no r m N = -90 dBm/Hz (simulation)N = -90 dBm/Hz (analysis)N = -70 dBm/Hz (simulation)N = -70 dBm/Hz (analysis) (a) Average gradient norm -4 -3 -2 -1 Computation energy rate [J/sample] T e s t a cc u r a cy ( % ) N = -90 dBm/HzN = -70 dBm/Hz (b) Test accuracyFigure 4. Effects of computation energy consumption and channel noise on the performance of a WP-FEEL system withbeacon-WPT. observations can be made. Both the analytical and experiment results are presented in the ﬁgure.Their discrepancy arises from some mismatch between the general analytical model speciﬁed inthe common Assumptions 1–4 and the speciﬁc dataset (i.e., MNIST) used in the experiments. Themismatch is observed to less for smaller N due to the averaging effect of more devices/more datainvolved in learning. Next, aligned with intuition, reducing N improves learning performanceby activating more devices as well as reducing the communication-energy consumption; therebymore energy is allowed for local gradient estimation and its accuracy improves. Next, givenﬁxed harvested energy per device, one can observe degradation of learning performance as the Beacon/Server transmission power (Watt) A v e r age g r ad i en t no r m Server-WPTBeacon-WPT: =5.8, P =0.01/m Beacon-WPT: =5.8, P =0.1/m Beacon-WPT: =4.0, P =0.01/m Beacon-WPT: =4.0, P =0.1/m Beacon/Server transmission power (Watt) T e s t a cc u r a cy ( % ) Server-WPTBeacon-WPT: =5.8, P =0.01/m Beacon-WPT: =5.8, P =0.1/m Beacon-WPT: =4.0, P =0.01/m Beacon-WPT: =4.0, P =0.1/m (a) Cell radius R = 20 m. Beacon/Server transmission power (Watt) A v e r age g r ad i en t no r m Server-WPTBeacon-WPT: =5.8, P =0.01/m Beacon-WPT: =5.8, P =0.1/m Beacon-WPT: =4.0, P =0.01/m Beacon-WPT: =4.0, P =0.1/m Beacon/Server transmission power (Watt) T e s t a cc u r a cy ( % ) Server-WPTBeacon-WPT: =5.8, P =0.01/m Beacon-WPT: =5.8, P =0.1/m Beacon-WPT: =4.0, P =0.01/m Beacon-WPT: =4.0, P =0.1/m (b) Cell radius R = 10 m.Figure 5. Learning-performance comparison between beacon-WPT and server-WPT. increasing computation-energy rate reduces the mini-batch size, the number of active devices,and the energy available for communication. C. Comparison between Beacon-WPT and Server-WPT

Based on experiments, the learning performance for the scenarios of beacon-WPT and server-WPT is compared in Fig. 5 for two different cell sizes. For the purpose of comparison, thetransmission power of power-beacons and serve are equalized ¯ P = P while the beacon density, λ pb , and the path-loss exponent of beacon-WPT links, β , are varied. For a relatively largecell ( R = 20 m), it can be observed that beacon-WPT always outperforms server-WPT inthe considered ranges of settings as the latter with only a single power source suffers from alow WPT efﬁciency due to severe path loss. This suggests the need of deploying power-beacons to power devices if FEEL is deployed in a large area to involve many devices. On the otherhand, for a relatively small cell ( R = 10 m), server-WPT has a higher WPT efﬁciency and canoutperform beacon-WPT when beacons are sparse and/or the path-loss exponent β is large.VII. C ONCLUDING R EMARKS

In this paper, we have proposed the application of WPT to a FEEL system as a solution for thepractical issue of high energy consumption at devices. To study the performance of the resultantWP-FEEL system, we have analyzed the optimal learning-WPT tradeoffs for both the scenarios ofbeacon-WPT and server-WPT. The results contribute useful insight and algorithms for designingand deploying WP-FEEL systems. This ﬁrst study of WP-FEEL opens several directions forfurther research including the application of WPT to support the implementation of other edge-learning frameworks (e.g., parameter server and reinforcement learning), the use of more complexwireless techniques (e.g., over-the-air aggregation and radio resource management), and thedesign of multi-cell networks. A

PPENDIX

A. Proof of Proposition 1

According to the update rule in (10) and Assumption 1, we have F ( w ( i +1) ) − F ( w ( i ) ) ≤ (cid:104)∇ F ( w ( i ) ) , w ( i +1) − w ( i ) (cid:105) + µ (cid:107) w ( i +1) − w ( i ) (cid:107) = − η (cid:104)∇ F ( w ( i ) ) , g ( i ) (cid:105) + µη (cid:107) g ( i ) (cid:107) . (48)Using (cid:107) g ( i ) (cid:107) = (cid:107) g ( i ) − ∇ F ( w ( i ) ) (cid:107) − (cid:107)∇ F ( w ( i ) ) (cid:107) + 2 (cid:104)∇ F ( w ( i ) ) , g ( i ) (cid:105) , we can derive F ( w ( i +1) ) − F ( w ( i ) ) ≤ µη (cid:0) (cid:107) g ( i ) − ∇ F ( w ( i ) ) (cid:107) − (cid:107)∇ F ( w ( i ) ) (cid:107) (cid:1) + ( µη − η (cid:104)∇ F ( w ( i ) ) , g ( i ) (cid:105) = µη (cid:107) g ( i ) − ∇ F ( w ( i ) ) (cid:107) + (cid:16) µη − (cid:17) η (cid:107)∇ F ( w ( i ) ) (cid:107) − (1 − µη ) η (cid:104)∇ F ( w ( i ) ) , g ( i ) − ∇ F ( w ( i ) ) (cid:105) . (49)The third term at the right-hand side in (49) can be upper bounded as −(cid:104)∇ F ( w ( i ) ) , g ( i ) − ∇ F ( w ( i ) ) (cid:105) ≤ (cid:0) (cid:107)∇ F ( w ( i ) ) (cid:107) + (cid:107) g ( i ) − ∇ F ( w ( i ) ) (cid:107) (cid:1) . (50)Given < η ≤ µ , we substitute (50) into (49) and rearrange the result, yielding (cid:107)∇ F ( w ( i ) ) (cid:107) ≤ (cid:0) F ( w ( i ) ) − F ( w ( i +1) ) (cid:1) η + (cid:107) g ( i ) − ∇ F ( w ( i ) ) (cid:107) . (51) It follows that E (cid:34) N N − (cid:88) i =0 (cid:107)∇ F ( w ( i ) ) (cid:107) (cid:35) ≤ (cid:0) F ( w (0) ) − E (cid:2) F ( w ( N ) ) (cid:3)(cid:1) ηN + 1 N N − (cid:88) i =0 E (cid:2) (cid:107) g ( i ) − ∇ F ( w ( i ) ) (cid:107) (cid:3) ≤ (cid:0) F ( w (0) ) − F ∗ (cid:1) ηN + 1 N N − (cid:88) i =0 E (cid:2) (cid:107) g ( i ) − ∇ F ( w ( i ) ) (cid:107) (cid:3) , (52)where the second inequality in (52) follows Assumption 3. This completes the proof. B. Proof of Lemma 3

The aggregated global gradient can be written using the indicator function I as g = 1 M (cid:88) k ∈M g k = 1 M (cid:88) Kk =1 I k ∈M g k , if M > and g = , otherwise . (53)For tractability, we introduce the following auxiliary gradient (cid:101) g which is deﬁned by (cid:101) g = 1 K K (cid:88) k =1 I k ∈M g k , where I k ∈M g k = 0 for k ∈ K \ M . (54)Then, we can derive the following upper bound for G gl using the auxiliary gradient: G gl = E M (cid:8) E (cid:2) (cid:107) (cid:101) g − ∇ F ( w ) + g − (cid:101) g (cid:107) (cid:3)(cid:9) ≤ E M (cid:110) E (cid:2) (cid:107) (cid:101) g − ∇ F ( w ) (cid:107) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) (a) + E (cid:2) (cid:107) g − (cid:101) g (cid:107) (cid:3)(cid:124) (cid:123)(cid:122) (cid:125) (b) (cid:111) . (55)In the following, we focus on terms (a) and (b) deﬁned in (55), respectively.1) Since ∇ F ( w ) = K (cid:80) Kk =1 ∇ F k ( w ) , we can ﬁrst decompose term (a) as follows:(a) = E (cid:2) (cid:107) (cid:101) g − ∇ F ( w ) (cid:107) (cid:3) = 1 K K (cid:88) k =1 K (cid:88) (cid:96) =1 E [ (cid:104) I k ∈M g k − ∇ F k ( w ) , I (cid:96) ∈M g (cid:96) − ∇ F (cid:96) ( w ) (cid:105) ]= 1 K (cid:18) K (cid:88) k =1 E [ (cid:107) I k ∈M g k − ∇ F k ( w ) (cid:107) ] (cid:124) (cid:123)(cid:122) (cid:125) (a1) + (cid:88) k (cid:54) = (cid:96) E [ (cid:104) I k ∈M g k − ∇ F k ( w ) , I (cid:96) ∈M g (cid:96) − ∇ F (cid:96) ( w ) (cid:105) ] (cid:124) (cid:123)(cid:122) (cid:125) (a2) (cid:19) , Next, we aim at ﬁnding the upper bounds for terms (a1) and (a2), respectively. • For term (a1), we can bound it as follows:(a1) = K (cid:88) k =1 I k ∈M E (cid:2) (cid:107) g k − ∇ F k ( w ) (cid:107) (cid:3) + K (cid:88) k =1 I k ∈K\M (cid:8) E (cid:2) (cid:107)∇ F k ( w ) (cid:107) (cid:3)(cid:9) ≤ K (cid:88) k =1 I k ∈M G lo ,k + ( K − M )Φ , (56) which comes from Jensen’s inequality (cid:107)∇ F k ( w ) (cid:107) = (cid:107) E [ g k ] (cid:107) ≤ E [ (cid:107) g k (cid:107) ] ≤ Φ . Then, E M [ (a1) ] ≤ E (cid:34) (cid:88) k ∈M G lo ,k (cid:35) + ( K − E [ M ])Φ . (57) • For term (a2), we can ﬁrst divide E [ (cid:104) I k ∈M g k − ∇ F k ( w ) , I (cid:96) ∈M g (cid:96) − ∇ F (cid:96) ( w ) (cid:105) ] into threecategories ( k (cid:54) = (cid:96) ) :i) Case 1: k ∈ M and (cid:96) ∈ M . Given Assumption 2, g k and g (cid:96) are uncorrelated, so E [ (cid:104) I k ∈M g k − ∇ F k ( w ) , I (cid:96) ∈M g (cid:96) − ∇ F (cid:96) ( w ) (cid:105) ]= E (cid:2) ( g k − ∇ F k ( w )) T ( g (cid:96) − ∇ F (cid:96) ( w )) (cid:3) = tr { cov ( g k , g (cid:96) ) } = 0 . (58)ii) Case 2: k ∈ M but (cid:96) / ∈ M (or k / ∈ M but (cid:96) ∈ M ). Given Assumption 2, g k is anunbiased estimate of ∇ F k ( w ) , resulting in E [ (cid:104) I k ∈M g k − ∇ F k ( w ) , I (cid:96) ∈M g (cid:96) − ∇ F (cid:96) ( w ) (cid:105) ] = − E [( g k − ∇ F k ( w )) T ∇ F (cid:96) ( w )] = 0 . (59)iii) Case 3: k / ∈ M and (cid:96) / ∈ M . When both device k and (cid:96) are outage, it holds E [ (cid:104) I k ∈M g k − ∇ F k ( w ) , I (cid:96) ∈M g (cid:96) − ∇ F (cid:96) ( w ) (cid:105) ] = ∇ F k ( w ) T ∇ F (cid:96) ( w ) . (60)Combining the above three cases, we can bound term (a2) as follows:(a2) = (cid:88) k (cid:54) = (cid:96) I k / ∈M I (cid:96)/ ∈M (cid:104)∇ F k ( w ) , ∇ F (cid:96) ( w ) (cid:105)≤ (cid:88) k (cid:54) = (cid:96) I k / ∈M I (cid:96)/ ∈M (cid:107)∇ F k ( w ) (cid:107)(cid:107)∇ F (cid:96) ( w ) (cid:107) ≤ (cid:88) k (cid:54) = (cid:96) I k / ∈M I (cid:96)/ ∈M Φ . (61)Then, taking expectation over M , we can obtain E M [ (a2) ] ≤ E (cid:34)(cid:88) k (cid:54) = (cid:96) I k / ∈M I (cid:96)/ ∈M (cid:35) Φ = ( K − K ) P out Φ . (62)In summary, the expected term (a) in (55) can be upper bounded by E M [ (a) ] ≤ K E (cid:34) (cid:88) k ∈M G lo ,k (cid:35) + K − E [ M ] K Φ + K − K P out Φ . (63)2) For term (b) in (55), we can derive the upper bound as follows:(b) = E (cid:2) (cid:107) g − (cid:101) g (cid:107) (cid:3) = I M> E (cid:34)(cid:13)(cid:13)(cid:13) (cid:18) M − K (cid:19) (cid:88) k ∈M g k (cid:13)(cid:13)(cid:13) (cid:35) + I M =0 × ≤ I M> (cid:18) M − K (cid:19) (cid:88) k ∈M E (cid:2) (cid:107) g k (cid:107) (cid:3) ≤ I M> (cid:18) M + MK − K (cid:19) Φ , (64) where we note that, when M = 0 , both g and (cid:101) g are zero, thus E [ (cid:107) g − (cid:101) g (cid:107) | M = 0] = 0 .Furthermore, considering E [ M ] = Pr( M > E [ M | M >

0] + Pr( M = 0) E [ M | M = 0] = Pr( M > E [ M | M > and Pr(

M >

0) = 1 − P K out , we can bound the expected (b) as E M [ (b) ] ≤ Pr(

M > E (cid:20) M + MK − K (cid:12)(cid:12)(cid:12) M > (cid:21) Φ= (cid:18) (1 − P K out ) E (cid:20) M (cid:12)(cid:12)(cid:12) M > (cid:21) + 1 K E [ M ] − (1 − P K out ) 2 K (cid:19) Φ . (65)Finally, substituting the results (63) and (65) into (55), we have G gl ≤ K E (cid:34) (cid:88) k ∈M G lo ,k (cid:35) + (1 − P K out ) (cid:18) E (cid:20) M (cid:12)(cid:12)(cid:12) M > (cid:21) − K (cid:19) Φ + P K out − P out K + P out Φ ≤ K E (cid:34) (cid:88) k ∈M G lo ,k (cid:35) + (1 − P K out ) (cid:18) E (cid:20) M (cid:12)(cid:12)(cid:12) M > (cid:21) − K (cid:19) Φ + P out Φ , (66)where we note that P K out − P out ≤ due to K ≥ . This completes the proof. C. Proof of Lemma 1

According to Deﬁnition 3, in the i -th round, the device k encounters a computation-outageevent if the condition E cmm k ( r ( i ) k , h ( i ) k ) ≥ ¯ E , which is equivalent to R α (cid:107) h ( i ) k (cid:107) ≤ (cid:16) r ( i ) k (cid:17) α ϕ ( T cmm ) R − α ¯ E , istrue. Deﬁne the domain Ξ = { ( r, h ) ∈ R : R α h ≤ r α ξ ; h ≥

0; 0 ≤ r ≤ R } for inactive devices,where ξ has the same deﬁnition as mentioned in (25). Since h ( i ) k and r ( i ) k are independent, weknow f ( r ( i ) k , (cid:107) h ( i ) k (cid:107) ) ( r, h ) = f (cid:107) h ( i ) k (cid:107) ( h ) f r ( i ) k ( r ) . Then, we can derive the outage probability: P out = (cid:90) (cid:90) Ξ f (cid:107) h ( i ) k (cid:107) ( h ) f r ( i ) k ( r ) drdh = (cid:90) R rR (cid:90) ( rR ) α ξ h L − e − h Γ( L ) dhdr = 2 R Γ( L ) (cid:90) R γ (cid:16) L, (cid:16) rR (cid:17) α ξ (cid:17) rdr = γ ( L, ξ ) − ξ − α γ (cid:0) L + α , ξ (cid:1) Γ( L ) . (67)This completes the proof. D. Proof of Lemma 5

The aggregated local gradient deviations come from the active devices. Deﬁne the domain

Θ = { ( r, h ) ∈ R : R α h > r α ξ ; h ≥

0; 0 ≤ r ≤ R } for active devices, where ξ has thesame deﬁnition as mentioned in (25). By variable transform, we introduce a new integral variant x (cid:44) (cid:0) rR (cid:1) α ξ , so that r = Rξ − α x α and dr = α Rξ − α x α − dx . Accordingly, the integral domainbecomes Θ = { ( x, h ) ∈ R : h > x ; 0 ≤ x ≤ ξ } . Then we can obtain E (cid:2) I k ∈M G (cid:63) lo ,k (cid:3) = W σ k C k ξ T cmp ϕ ( T cmm ) (cid:90) (cid:90) Θ (cid:18) hR α h − r α ξ (cid:19) h L − e − h Γ( L ) 2 rR drdh = 2 W σ k C k ξ − α αT cmp R α ϕ ( T cmm ) Γ( L ) (cid:90) (cid:90) Θ h L − e − h ( h − x ) x − α dxdh (cid:124) (cid:123)(cid:122) (cid:125) (c) . (68)The integration (c) deﬁned in (68) can be decomposed into two terms as follows:(c) = (cid:90) ξ (cid:90) h h L − e − h ( h − x ) x − α dxdh (cid:124) (cid:123)(cid:122) (cid:125) (c1) + (cid:90) ∞ ξ (cid:90) ξ h L − e − h ( h − x ) x − α dxdh (cid:124) (cid:123)(cid:122) (cid:125) (c2) , (69)1) For term (c1), we can derive the result as follows:(c1) = B (cid:18) , α (cid:19) (cid:90) ξ h L + α − e − h dh = B (cid:18) , α (cid:19) γ (cid:18) L + 2 α , ξ (cid:19) . (70)2) For term (c2), we can ﬁnd an upper bound as follows:(c2) = (cid:90) ∞ ξ h L − e − h (cid:90) ξ (cid:16) − xh (cid:17) − x α − dxdh ≤ (cid:90) ∞ ξ h L − e − h dh (cid:90) ξ (cid:18) − xξ (cid:19) − x α − dx = B (cid:18) , α (cid:19) ξ α Γ( L, ξ ) . (71)Substituting the results (70) and (71) into (68), we can derive the following upper bound: E (cid:2) I k ∈M G (cid:63) lo ,k (cid:3) ≤ W σ k C k B ( , α ) ξ αT cmp R α ϕ ( T cmm ) Γ( L ) (cid:18) ξ − α γ (cid:18) L + 2 α , ξ (cid:19) + Γ( L, ξ ) (cid:19) . (72)Expanding E [ I k ∈M G lo ,k ] and comparing the right-hand side with the expression of P out , we have Pr( k ∈ M ) E (cid:2) G (cid:63) lo ,k | k ∈ M (cid:3) ≤ W σ k C k B ( , α ) ξ αT cmp R α ϕ ( T cmm ) (1 − P out ) . (73)Note that Pr( k ∈ M ) = 1 − P out and this completes the proof. E. Proof of Corollary 3

Applying the second mean value theorem for deﬁnite integrals, we can derive (with τ (cid:28) ) P (cid:48) out = (cid:90) τ K (2 √ x )ln x x L − (cid:104) − ( xτ ) α (cid:105) ln x Γ( L ) dx = (cid:20) lim x → + K (2 √ x )ln x (cid:21) (cid:90) τ (cid:48) x L − (cid:104) − ( xτ ) α (cid:105) ln x Γ( L ) dx ≤ − (cid:90) τ x L − (cid:104) − ( xτ ) α (cid:105) ln x Γ( L ) dx = τ L (1 + 2 αL − L (1 + αL ) ln τ )Γ( L ) L (1 + αL ) , (74) where K (2 √ x )ln x is a negative monotone increasing function around + with lim x → + K (2 √ x )ln x = − and the number τ (cid:48) ∈ (0 , τ ] . This completes the proof.R EFERENCES [1] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang, “Edge intelligence: Paving the last mile of artiﬁcial intelligencewith edge computing,”

Proc. IEEE , vol. 107, no. 8, pp. 1738–1762, 2019.[2] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Toward an intelligent edge: Wireless communication meetsmachine learning,”

IEEE Commun. Mag. , vol. 58, no. 1, pp. 19–25, 2020.[3] W. Y. B. Lim, N. C. Luong, D. T. Hoang, Y. Jiao, Y. C. Liang, Q. Yang, D. Niyato, and C. Miao, “Federated learning inmobile edge networks: A comprehensive survey,”

IEEE Commun. Surveys Tuts. , vol. 22, no. 3, pp. 2031–2063, 2020.[4] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resourceconstrained edge computing systems,”

IEEE J. Sel. Areas Commun. , vol. 37, no. 6, pp. 1205–1221, 2019.[5] M. M. Amiri and D. G¨und¨uz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,”

IEEE Trans. Signal Process. , vol. 68, pp. 2155–2169, 2020.[6] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,”

IEEE Trans.Wireless Commun. , vol. 19, no. 1, pp. 491–506, 2020.[7] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,”

IEEE Trans. Wireless Commun. ,vol. 19, no. 3, pp. 2022–2035, 2020.[8] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federatedlearning over wireless networks,”

IEEE Trans. Wireless Commun. , vol. 20, no. 1, pp. 269–283, 2021.[9] J. Ren, G. Yu, and G. Ding, “Accelerating dnn training in wireless federated edge learning systems,”

IEEE J. Sel. AreasCommun. , vol. 39, no. 1, pp. 219–232, 2021.[10] D. Wen, M. Bennis, and K. Huang, “Joint parameter-and-bandwidth allocation for improving the efﬁciency of partitionededge learning,”

IEEE Trans. Wireless Commun. , vol. 19, no. 12, pp. 8272–8286, 2020.[11] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling policies for federated learning in wireless networks,”

IEEE Trans. Commun. , vol. 68, no. 1, pp. 317–333, 2020.[12] Y. Du, S. Yang, and K. Huang, “High-dimensional stochastic gradient quantization for communication-efﬁcient edgelearning,”

IEEE Trans. Signal Process. , vol. 68, pp. 2128–2142, 2020.[13] Y. Sun, S. Zhou, and D. G¨und¨uz, “Energy-aware analog aggregation for federated learning with redundant data,” in

IEEEInt. Conf. Commun. (ICC) , Dublin, Ireland, Jun 7-11, 2020.[14] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “Energy efﬁcient federated learning over wirelesscommunication networks,” to appear in IEEE Trans. Wireless Commun. , 2020.[15] X. Mo and J. Xu, “Energy-efﬁcient federated edge learning with joint communication and computation design,” [Online]https://arxiv.org/pdf/2003.00199.pdf , 2020.[16] Q. Zeng, Y. Du, K. Huang, and K. K. Leung, “Energy-efﬁcient resource management for federated edge learning withcpu-gpu heterogeneous computing,” [Online] https://arxiv.org/pdf/2007.07122.pdf , 2020.[17] B. Clerckx, R. Zhang, R. Schober, D. W. K. Ng, D. I. Kim, and H. V. Poor, “Fundamentals of wireless information andpower transfer: From RF energy harvester models to signal and system designs,”

IEEE J. Sel. Areas Commun. , vol. 37,no. 1, pp. 4–33, 2019.[18] B. Clerckx, K. Huang, L. R. Varshney, S. Ulukus, and M.-S. Alouini, “Wireless power transfer for future networks: Signalprocessing, machine learning, computing, and sensing,” [Online] https://arxiv.org/pdf/2101.04810.pdf , 2021. [19] K. Huang and X. Zhou, “Cutting the last wires for mobile communications by microwave power transfer,” IEEE Commun.Mag. , vol. 53, no. 6, pp. 86–93, 2015.[20] R. Zhang and C. K. Ho, “Mimo broadcasting for simultaneous wireless information and power transfer,”

IEEE Trans.Wireless Commun. , vol. 12, no. 5, pp. 1989–2001, 2013.[21] H. Ju and R. Zhang, “Throughput maximization in wireless powered communication networks,”

IEEE Trans. WirelessCommun. , vol. 13, no. 1, pp. 418–428, 2014.[22] K. Huang and V. K. N. Lau, “Enabling wireless power transfer in cellular networks: Architecture, modeling anddeployment,”

IEEE Trans. Wireless Commun. , vol. 13, no. 2, pp. 902–912, 2014.[23] Q. Wu, M. Tao, D. W. Kwan Ng, W. Chen, and R. Schober, “Energy-efﬁcient resource allocation for wireless poweredcommunication networks,”

IEEE Trans. Wireless Commun. , vol. 15, no. 3, pp. 2312–2327, 2016.[24] X. Zhou, R. Zhang, and C. K. Ho, “Wireless information and power transfer: Architecture design and rate-energy tradeoff,”

IEEE Trans. Commun. , vol. 61, no. 11, pp. 4754–4767, 2013.[25] G. Zhu, Y. Du, D. G¨und¨uz, and K. Huang, “One-bit over-the-air aggregation for communication-efﬁcient federated edgelearning: Design and convergence analysis,” to appear in IEEE Trans. Wireless Commun. , 2020.[26] H. Yu, S. Yang, and S. Zhu, “Parallel restarted SGD with faster convergence and less communication: Demystifying whymodel averaging works for deep learning,” in

Proc. AAAI Conf. Artif. Intell. , Honolulu, USA, Jan 27 - Feb 1, 2019.[27] D. Basu, D. Data, C. Karakus, and S. Diggavi, “Qsparse-local-SGD: Distributed SGD with quantization, sparsiﬁcation andlocal computations,” in

Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , Vancouver, Canada, Dec 8-14, 2019.[28] A. Koloskova, S. Stich, and M. Jaggi, “Decentralized stochastic optimization and gossip algorithms with compressedcommunication,” in

Proc. Int. Mach. Learn. Res. (ICLR) , Long Beach, USA, Jun 9-15, 2019.[29] F. Zhou and G. Cong, “On the convergence properties of a k-step averaging stochastic gradient descent algorithm fornonconvex optimization,” in

Proc. Int. Joint Conf. Artif. Intell., (IJCAI) , Stockholm, Sweden, Jul 13-19, 2018.[30] F. Baccelli, B. Blaszczyszyn, and P. Muhlethaler, “An aloha protocol for multihop mobile wireless networks,”

IEEE Trans.Inf. Theory , vol. 52, no. 2, pp. 421–436, 2006.[31] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efﬁcient learning of deep networksfrom decentralized data,” in

Proc. Int. Conf. Artif. Intell. Statist. (AISTATS) , Fort Lauderdale, USA, Apr 20-22, 2017.[32] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufﬂenet: An extremely efﬁcient convolutional neural network for mobiledevices,” in

Proc. IEEE/CVF Conf. Comput. Vision Pattern Recognit. (CVPR) , Salt Lake City, USA, Jun 18-23, 2018.[33] C. Liu, J. Li, W. Huang, J. Rubio, E. Speight, and F. Lin, “Power-efﬁcient time-sensitive mapping in heterogeneoussystems,” in

Proc. Int. Conf. Parallel Archit. Compilation Tech. (PACT) , Minneapolis, USA, Sep 21-25, 2012.[34] J. Bernstein, Y.-X. Wang, K. Azizzadenesheli, and A. Anandkumar, “signSGD: Compressed optimisation for non-convexproblems,” in

Proc. Int. Conf. Mach. Learn. , Stockholm, Sweden, Jul 10-15, 2018.[35] F. Sattler, S. Wiedemann, K. R. M¨uller, and W. Samek, “Robust and communication-efﬁcient federated learning fromnon-i.i.d. data,”

IEEE Trans. Neural Netw. Learn. Syst. , vol. 31, no. 9, pp. 3400–3413, 2020.[36] Z. Allen-Zhu, “Natasha 2: Faster non-convex optimization than SGD,” in

Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) ,Montreal, Canada, Dec 2-8, 2018.[37] L. Bottou, F. E. Curtis, and J. Nocedal, “Optimization methods for large-scale machine learning,”

SIAM Rev. , vol. 60,no. 2, pp. 223–311, 2018.[38] J. Wu, W. Hu, H. Xiong, J. Huan, V. Braverman, and Z. Zhu, “On the noisy gradient descent that generalizes as SGD,” [Online] https://arxiv.org/pdf/1906.07405.pdf , 2019.[39] F. F. Stephan, “The expected value and variance of the reciprocal and other negative powers of a positive bernoullianvariate,”