Federated Learning over Wireless Networks: Convergence Analysis and Resource Allocation
Canh T. Dinh, Nguyen H. Tran, Minh N. H. Nguyen, Choong Seon Hong, Wei Bao, Albert Y. Zomaya, Vincent Gramoli
11 Federated Learning over Wireless Networks:Convergence Analysis and Resource Allocation
Canh Dinh, Nguyen H. Tran, Minh N. H. Nguyen, Choong Seon Hong, Wei Bao, Albert Y. Zomaya, VincentGramoli
Abstract —There is an increasing interest in a fast-growingmachine learning technique called Federated Learning (FL),in which the model training is distributed over mobile userequipment (UEs), exploiting UEs’ local computation and trainingdata. Despite its advantages such as preserving data privacy,FL still has challenges of heterogeneity across UEs’ data andphysical resources. To address these challenges, we first proposea FL algorithm which can handle heterogeneous UE data withoutfurther assumptions except strongly convex and smooth lossfunctions. We provide a convergence rate characterizing thetrade-off between local computation rounds of each UE toupdate its local model and global communication rounds toupdate the FL global model. We then employ the proposedFL algorithm in wireless networks as a resource allocationoptimization problem that captures the trade-off between theFL convergence wall clock time and energy consumption of UEswith heterogeneous computing and power resources. Even thoughthe wireless resource allocation problem of FL is non-convex,we exploit this problem’s structure to decompose it into threesub-problems and analyze their closed-form solutions as well asinsights into problem design. Finally, we illustrate the theoreticalanalysis for the new algorithm with Tensorflow experiments andextensive numerical results for the wireless resource allocationsub-problems. Experimental results not only verify the theoreticalconvergence but also show that our proposed algorithm outper-forms the vanilla FedAvg algorithm in terms of convergence rateand test accuracy.
Index Terms —Distributed Machine Learning, FederatedLearning, Optimization Decomposition.
I. I
NTRODUCTION
The significant increase in the number of cutting-edgemobile and Internet of Things (IoT) devices results in thephenomenal growth of the data volume generated at the edgenetwork. It has been predicted that in 2025 there will be80 billion devices connected to the Internet and the globaldata will achieve 180 trillion gigabytes [2]. However, mostof this data is privacy-sensitive in nature. It is not only riskyto store this data in data centers but also costly in terms ofcommunication. For example, location-based services such asthe app Waze [3], can help users avoid heavy-traffic roads andthus reduce congestion. However, to use this application, usershave to share their own locations with the server and it cannotguarantee that the location of drivers is kept safely. Besides,in order to suggest the optimal route for drivers, Waze collects
C. Dinh, N. H. Tran, W. Bao, A.Y. Zomaya, and V. Gramoli, are withthe School of Computer Science, The University of Sydney, Sydney, NSW2006, Australia (email: [email protected], { nguyen.tran, wei.bao,albert.zomaya, vincent.gramoli } @sydney.edu.au).M. N.H. Nguyen and C.S Hong are with the Department of ComputerScience and Engineering, Kyung Hee University, Korea (email: { minhnhn,cshong } @khu.ac.kr).A previous version of this paper was presented at IEEE INFOCOM 2019 [1]. a large number of data including every road driven to transferto the data center. Transferring this amount of data requiresa high expense in communication and drivers’ devices to beconnected to the Internet continuously.In order to maintain the privacy of consumer data andreduce the communication cost, it is necessary to have anemergence of a new class of machine learning techniquesthat shifts computation to the edge network where the privacyof data is maintained. One such popular technique is calledFederated Learning (FL) [4]. This technology allows users tocollaboratively build a shared learning model while preservingall training data on their user equipment (UE). In particular,a UE computes the updates to the current global model onits local training data, which is then aggregated and fed-backby a central server, so that all UEs have access to the sameglobal model to compute their new updates. This processis repeated until an accuracy level of the learning model isreached. In this way, the user data privacy is well protectedbecause local training data are not shared, which thus differsFL from conventional approaches in data acquisition, storage,and training.There are several reasons why FL is attracting plenty ofinterests. Firstly, modern smart UEs are now able to handleheavy computing tasks of intelligent applications as they arearmed with high-performance central processing units (CPUs),graphics processing units (GPUs) and integrated AI chipscalled neural processing units (e.g., Snapdragon 845, Kirin980 CPU and Apple A12 Bionic CPU [5]). Being equippedwith the latest computing resources at the edge, the modeltraining can be updated locally leading to the reduction inthe time to upload raw data to the data center. Secondly, theincrease in storage capacity, as well as the plethora of sensors(e.g., cameras, microphones, GPS) in UEs enables them tocollect a wealth amount of data and store it locally. Thisfacilitates unprecedented large-scale flexible data collectionand model training. With recent advances in edge computing,FL can be more easily implemented in reality. For example, acrowd of smart devices can proactively sense and collect dataduring the day hours, then jointly feedback and update theglobal model during the night hours, to improve the efficiencyand accuracy for next-day usage. We envision that such thisapproach will boost a new generation of smart services, suchas smart transportation, smart shopping, and smart hospital.Despite its promising benefits, FL comes with new chal-lenges to tackle. On one hand, the number of UEs in FLcan be large and the data generated by UEs have diversedistributions [4]. Designing efficient algorithms to handlestatistical heterogeneity with convergence guarantee is thus a r X i v : . [ c s . L G ] M a r a priority question. Recently, several studies [4], [6], [7]have used de facto optimization algorithms such as GradientDescent (GD), Stochastic Gradient Descent (SGD) to enabledevices’ local updates in FL. One of the most well-knownmethods named FedAvg [4] which uses average SGD updateswas experimentally shown to perform well in heterogeneousUE data settings. However, this work lacks theoretical con-vergence analysis. By leveraging edge computing to enableFL, [7] proposed algorithms for heterogeneous FL networksby using GD with bounded gradient divergence assumptionto facilitate the convergence analysis. In another direction,the idea of allowing UEs to solve local problems in FL witharbitrary optimization algorithm to obtain a local accuracy (orinexactness level) has attracted a number of researchers [8],[9]. While [8] uses primal-dual analysis to prove the algorithmconvergence under any distribution of data, the authors of[9] propose adding proximal terms to local functions and useprimal analysis for convergence proof with a local dissimilarityassumption, a similar idea of bounding the gradient divergencebetween local and global loss functions.While all of the above FL algorithms’ complexities aremeasured in terms of the number of local and global updaterounds (or iterations), the wall clock time of FL when deployedin a wireless environment mainly depends on the number ofUEs and their diverse characteristics, since UEs may havedifferent hardware, energy budget, and wireless connectionstatus. Specifically, the total wall-clock training time of FLincludes not only the UE computation time (which depend onUEs’ CPU types and local data sizes) but also the commu-nication time of all UEs (which depends on UEs’ channelgains, transmission power, and local data sizes). Thus, tominimize the wall-clock training time of FL, a careful resourceallocation problem for FL over wireless networks needs toconsider not only the FL parameters such as accuracy levelfor computation-communication trade-off, but also allocatingthe UEs’ resources such as power and CPU cycles withrespect to wireless conditions. From the motivations above,our contributions are summarized as follows: • We propose a new FL algorithm with only assumption ofstrongly convex and smooth loss functions, named
FEDL .The crux of
FEDL is a new local surrogate function,which is designed for each UE to solve its local problemapproximately up to a local accuracy level θ , and ischaracterized by a hyper-learning rate η . Using primalconvergence analysis, we show the convergence rate of FEDL by controlling η and θ , which also provides thetrade-off between the number of local computation andglobal communication rounds. We then employ FEDL onTensorflow to verify the theoretical findings with severalfederated datasets. The experimental results show that
FEDL outperforms the vanilla FedAvg [4] in terms oftraining loss, convergence rate and test accuracy. • We propose a resource allocation problem for
FEDL over wireless networks to capture the trade-off betweenthe wall clock training time of
FEDL and UE energyconsumption by using the Pareto efficiency model. Tohandle the non-convexity of this problem, we exploit its special structure to decompose it into three sub-problems.The first two sub-problems relate to UE resource al-location over wireless networks, which are transformedto be convex and solved separately; then their solutionsare used to obtain the solution to the third sub-problem,which gives the optimal η and θ of FEDL . We derivetheir closed-form solutions, and characterize the impactof the Pareto-efficient controlling knob to the optimal: (i)computation and communication training time, (ii) UEresource allocation, and (iii) hyper-learning rate and localaccuracy. We also provide extensive numerical results toexamine the impact of UE heterogeneity and Pareto curveof UE energy cost and wall clock training time.The rest of this paper is organized as follows. Section II dis-cusses related works. Section III contains system model. Sec-tions IV and V provide the proposed FL algorithm’s analysisand resource allocation over wireless networks, respectively.Experimental performance of
FEDL and numerical results ofthe resource allocation problem are provided in Section VI andSection VII, respectively. Section VIII concludes our work.II. R
ELATED W ORKS
Due to Big Data applications and complex models such asDeep Learning, training machine learning models needs to bedistributed over multiple machines, giving rise to researcheson decentralized machine learning [10]–[15]. However, mostof the algorithms in these works are designed for machineshaving balanced and/or independent and identically distributed(i.i.d.) data. Realizing the lack of studies in dealing withunbalanced and heterogeneous data distribution, an increasingnumber of researchers place interest in studying FL, a state-of-the-art distributed machine learning technique [4], [7], [9],[16], [17]. This technique takes advantage of the involvementof a large number of devices where data are generated locally,which makes them statistically heterogeneous in nature. Asa result, designing algorithms with global model’s conver-gence guarantee becomes challenging. There are two mainapproaches to overcome this problem.The first approach is based on de facto algorithm SGDwith a fixed number of local iterations on each device [4].Despite its feasibility, these studies still have limitations aslacking the convergence analysis. The work in [7], on theother hand, used GD and additional assumptions on Lipschitzlocal functions and bounded gradient divergence to prove thealgorithm convergence.Another useful approach to tackling the heterogeneity chal-lenge is to allow UEs to solve their primal problems approxi-mately up to a local accuracy threshold [9], [17]. Their worksshow that the main benefit of this approximation approachis that it allows flexibility in the compromise between thenumber of rounds run on the local model update and thecommunication to the server for the global model update.While the authors of [9] use primal convergence analysiswith bounded gradient divergence assumption and show thattheir algorithm can apply to non-convex FL setting, [17] usesprimal-dual convergence analysis, which is only applicable toFL with convex problems.
From a different perspective, many researchers have recentlyfocused on the efficient communications between UEs andedge servers in FL-supported networks [1], [7], [18]–[22].The work [7] proposes algorithms for FL in the contextof edge networks with resource constraints. While there areseveral works [23], [24] that study minimizing communicatedmessages for each global iteration update by applying spar-sification and quantization, it is still a challenge to utilizethem in FL networks. For example, [18] uses the gradientquantization, gradient sparsification, and error accumulation tocompress gradient message under the wireless multiple-accesschannel with the assumption of noiseless communication. Asimilar work ( [19]) studies the same quantization techniqueto explore convergence guarantee with low-precision training.Having the idea of figuring out the impacts of wirelessresource on minimizing the training time, [20] focuses onusing cell-free massive MIMO to support FL. Contrary tomost of these works which make use of existing, standardFL algorithms, our work proposes a new one. The authorsin [21] adopt and compare a number of scheduling policiesin wireless. [22] considers the problem of simultaneouslyoptimizing the completion time of FL in conjunction withthe computation and transmission energy. [25] uses the FLapproach to improve the experience of virtual reality (VR)for wireless users. Nevertheless, these works lack studies onunbalanced and heterogeneous data among UEs. We study howthe computation and communication characteristics of UEs canaffect their energy consumption, training time, convergenceand accuracy level of FL, considering heterogeneous UEsin terms of data size, channel gain and computational andtransmission power capabilities.III. S
YSTEM M ODEL
We consider a wireless multi-user system which consists ofone edge server and a set N of N UEs. Each participatingUE n stores a local dataset D n , with its size denoted by D n .Then, we can define the total data size by D = (cid:80) Nn =1 D n . Inan example of the supervised learning setting, at UE n , D n defines the collection of data samples given as a set of input-output pairs { x i , y i } D n i =1 , where x i ∈ R d is an input samplevector with d features, and y i ∈ R is the labeled output valuefor the sample x i . The data can be generated through the usageof UE, for example, via interactions with mobile apps.In a typical learning problem, for a sample data { x i , y i } with input x i (e.g., the response time of various apps insidethe UE), the task is to find the model parameter w ∈ R d thatcharacterizes the output y i (e.g., label of edge server load, suchas high or low, in next hours) with the loss function f i ( w ) .The loss function on the data set of UE n is defined as F n ( w ) := 1 D n (cid:88) i ∈D n f i ( w ) . Then, the learning model is the minimizer of the followingglobal loss function minimization problem min w ∈ R d F ( w ) := (cid:88) Nn =1 p n F n ( w ) , (1)where p n := D n D , ∀ n . Assumption 1. F n ( · ) is L -smooth and β -strongly convex, ∀ n ,respectively, as follows, ∀ w, w (cid:48) ∈ R d : F n ( w ) ≤ F n ( w (cid:48) )+ (cid:10) ∇ F n ( w (cid:48) ) , w − w (cid:48) (cid:11) + L (cid:107) w − w (cid:48) (cid:107) F n ( w ) ≥ F n ( w (cid:48) )+ (cid:10) ∇ F n ( w (cid:48) ) , w − w (cid:48) (cid:11) + β (cid:107) w − w (cid:48) (cid:107) . Throughout this paper, (cid:104) w, w (cid:48) (cid:105) denotes the inner product ofvectors w and w (cid:48) and (cid:107)·(cid:107) is Euclidean norm. We note thatstrong convexity and smoothness in Assumption 1, also usedin [7], can be found in a wide range of applications such as l -regularized linear regression model with f i ( w ) = ( (cid:104) x i , w (cid:105) − y i ) + β (cid:107) w (cid:107) , y i ∈ R , and l -regularized logistic regressionwith f i ( w ) = log (cid:0) − y i (cid:104) x i , w (cid:105) ) (cid:1) + β (cid:107) w (cid:107) , y i ∈{− , } . Note that we don’t need the assumption of gradientdivergence bound as in [7, Definition 1], [9, Assumption 1],which means our algorithm analysis can apply to the generalcase of heterogeneous UEs’ data distribution. We also denote ρ := Lβ the condition number of F n ( · ) ’s Hessian matrix.IV. F EDERATED L EARNING A LGORITHM D ESIGN
In this section, we propose a general FL framework for adistributed machine learning system, named
FEDL , as pre-sented in Algorithm 1. We briefly summarize the algorithm inwhat follows. To solve problem (1),
FEDL uses an iterativeapproach that requires K g global rounds for global modelupdates. In each global round, there are interactions betweenthe UEs and edge server. Specifically, a participating UE, ineach computation phase, will update its local model using localtraining data D n as follows. UEs update local models:
In order to obtain the localmodel w tn at a global round t , each UE n first receives thefeedback information w t − and ∇ ¯ F t − (which will be definedlater in (4) and (5), respectively) from the server, and thenminimize its following surrogate function (line 3) min w ∈ R d J tn ( w ) := F n ( w ) + (cid:10) η ∇ ¯ F t − − ∇ F n ( w t − ) , w (cid:11) . (2)One of the key ideas of FEDL is UEs can solve (2) approxi-mately to obtain an approximation solution w tn satisfying (cid:13)(cid:13) ∇ J tn ( w tn ) (cid:13)(cid:13) ≤ θ (cid:13)(cid:13) ∇ J tn ( w t − ) (cid:13)(cid:13) , ∀ n, (3)which is parametrized by a local accuracy θ ∈ (0 , that iscommon to all UEs. This local accuracy concept resemblesthe approximate factors in [8], [26]. Here θ = 0 meansthe local problem (2) is required to be solved optimally,and θ = 1 means no progress for local problem, e.g., bysetting w tn = w t − . The surrogate function J tn ( . ) (2) ismotivated from the scheme Distributed Approximate NEwton(DANE) proposed in [12]. However, DANE requires (i) theglobal gradient ∇ F ( w t − ) (which is not available at UEsor server in FL context) (ii) additional proximal terms (i.e., µ (cid:13)(cid:13) w − w t − (cid:13)(cid:13) ) and assumption that UEs’ samples are i.i.d.(i.e., not true for FL context), (ii) solving local problem(2) exactly (i.e., θ = 0 ). On the other hand, FEDL uses(i) the global gradient estimate ∇ ¯ F t − , which can be mea-sured by the server from UE’s information, instead of exactbut unrealistic ∇ F ( w t − ) , (ii) avoids using proximal terms Algorithm 1
FEDL Input: w , θ ∈ [0 , , η > . for t = 1 to K g do Computation:
Each UE n receives w t − and ∇ ¯ F t − from the server, and solves (2) in K l rounds to achieve θ -approximation solution w tn satisfying (3). Communication: UE n transmit w tn and ∇ F n ( w tn ) , ∀ n , to the edge server. Aggregation and Feedbacks:
The edge server updatesthe global model w t and ∇ ¯ F t as in (4) and (5),respectively, and then fed-backs them to all UEs.to limit additional controlling parameter (i.e., µ ), and (iii)flexibly solves local problem approximately by controlling θ . Compared to FedAvg [4] solving pure F n ( · ) (withoutconvergence analysis) and FedProx [9] solving F n ( · ) withadditional proximal term µ (cid:13)(cid:13) w − w t − (cid:13)(cid:13) (with convergenceanalysis using gradient divergence bound), FEDL avoids thegradient divergence bound assumption, and thus has no restric-tion on heterogeneous UEs’ data distribution. Furthermore, wehave ∇ J tn ( w ) = ∇ F n ( w ) + η ∇ ¯ F t − − ∇ F n ( w t − ) , whichincludes both local and global gradient estimate weighted bya controllable parameter η . We will see later how η affects tothe convergence of FEDL . Edge server updates global model:
After receiving the local model w tn and gradient ∇ F n ( w tn ) , ∀ n , the edge server aggregates them as follows w t := (cid:88) Nn =1 p n w tn , (4) ∇ ¯ F t := (cid:88) Nn =1 p n ∇ F n ( w tn ) (5)and then broadcast w t and ∇ ¯ F t to all UEs (line 5), whichare required for participating UEs to minimize their surrogate J t +1 n in the next global round t +1 . We see that the edge serverdoes not access the local data D n , ∀ n , thus preserving dataprivacy. For an arbitrary small constant (cid:15) > , the problem (1)achieves a global model convergence w t when its satisfies F ( w t ) − F ( w ∗ ) ≤ (cid:15), ∀ t ≥ K g , (6)where w ∗ is the optimal solution to (1).Next, we will provide the convergence analysis for FEDL .We see that J tn ( w ) is also β -strongly convex and L -smooth as F n ( · ) because they have the same Hessian matrix. With theseproperties of J tn ( w ) , we can use GD to solve (2) as follows z k +1 = z k − h k ∇ J tn ( z k ) , (7)where z k is the local model update and h k is a predefinedlearning rate at iteration k , which has been shown to generatea convergent sequence ( z k ) k ≥ satisfying a linear convergencerate [27] as follows J tn ( z k ) − J tn ( z ∗ ) ≤ c (1 − γ ) k (cid:0) J tn ( z ) − J tn ( z ∗ ) (cid:1) , (8)where z ∗ is the optimal solution to the local problem (2), and c and γ ∈ (0 , are constants depending on ρ . Lemma 1.
With Assumption 1 and the assumed linear con-vergence rate (8) with z = w t − , the number of local rounds K l for solving (2) to achieve a θ -approximation condition (3) is K l = 2 γ log Cθ , (9) where C := cρ . Theorem 1.
With Assumption 1, the convergence of
FEDL isachieved with linear rate F ( w t ) − F ( w ∗ ) ≤ (1 − Θ) t ( F ( w (0) ) − F ( w ∗ )) , (10) where Θ ∈ (0 , is defined as Θ := 2 η (2( θ − − ( θ + 1) θ (3 η + 2) ρ − ( θ + 1) ηρ ) ρ (cid:16) − θ ) η ρ (cid:17) . (11) Corollary 1.
The number of global rounds for
FEDL toachieve the convergence satisfying (6) is K g = 1Θ log F ( w ) − F ( w ∗ ) (cid:15) . (12)The proof of this corollary can be shown similarly to thatof Lemma 1. We have some following remarks:1) The convergence of FEDL can always be obtained bysetting sufficiently small values of both η and θ ∈ (0 , such that Θ ∈ (0 , . To see why that is, in the de-nominator of (11), we can always choose small θ and η to make it positive, i.e., − θ ) η ρ > . Thenumerator of (11) can be rewritten as η ( A − B ) , where A = 2( θ − − ( θ + 1) θ (3 η + 2) ρ and B = ( θ + 1) ηρ .Since lim θ → A = 2 and lim θ,η → B = 0 , there existssmall values of θ and η such that A − B > , thus Θ > .On the other hand, we have lim η → Θ = 0 ; thus, thereexists a small value of η such that Θ < .2) There is a convergence trade-off between the numberof local and global rounds characterized by θ : small θ makes large K l , yet small K g , according to (9) and(12), respectively. This trade-off was also observed byauthors of [8], though their technique (i.e., primal-dualoptimization) is different from ours.3) While θ affects to both local and global convergence, η only affects to the global convergence rate of FEDL . If η is small, then Θ is also small, thus inducing large K g .However, if η is large enough, Θ may not be in (0 , ,which leads to the divergence of FEDL . We call η the hyper-learning rate for the global problem (1).4) The condition number ρ also affects to the FEDL con-vergence: if ρ is large (i.e., poorly conditioned problem(2)), both η and θ should be sufficiently small in order for Θ ∈ (0 , (i.e., slow convergence rate.) This observationis well-aligned to traditional optimization convergenceanalysis [28, Chapter 9].The time complexity of FEDL is represented by K g communication rounds and computation complexity is K g K l computation rounds. Even though FEDL convergence is inde-pendent of the number of UEs (c.f. (12)), when implementing
FEDL over wireless networks, the wall clock time of eachcommunication round can be significantly larger than that ofcomputation if the number of UEs increases, due to multi-user contention for wireless medium. In the next section, wewill study the UE resource allocation to enable
FEDL overwireless networks.V.
FEDL
OVER W IRELESS N ETWORKS
In this section, we first present the system model andproblem formulation of the UE resource allocation supporting
FEDL over a time-sharing wireless environment. We thendecompose this problem into three sub-problems, derive theirclosed-form solutions, reveal the hindsights, and provide nu-merical support.
A. System Model
At first, we consider synchronous communication whichrequires all UEs to finish solving their local problems beforeentering the communication phase. During the communicationphase, the model’s updates are transferred to the edge serverby using a wireless medium sharing scheme. In the com-munication phase, each global round consists of computationand communication time which includes uplink and downlinkones. In this work, however, we do not consider the downlinkcommunication time as it is negligible compared to the uplinkone. The reason is that the downlink has larger bandwidththan the uplink and the edge server power is much higherthan UE’s transmission power. Besides, the computation timeonly depends on the number of local rounds, and thus θ ,according to (9). Denoting the time of one local round by T cp , i.e., the time to computing one local round (8), then thecomputation time in one global round is K l T cp . Denoting thecommunication time in one global round by T co , the wall clocktime of one global round of FEDL is defined as T g := T co + K l T cp .
1) Computation Model:
We denote the number of CPUcycles for UE n to execute one sample of data by c n , whichcan be measured offline [29] and is known a priori. Since allsamples { x i , y i } i ∈D n have the same size (i.e., number of bits),the number of CPU cycles required for UE n to run one localround is c n D n . Denote the CPU-cycle frequency of the UE n by f n . Then the CPU energy consumption of UE n for onelocal round of computation can be expressed as follows [30] E n,cp = (cid:88) c n D n i =1 α n f n = α n c n D n f n , (13)where α n / is the effective capacitance coefficient of UE n ’s computing chipset. Furthermore, the computation time perlocal round of the UE n is c n D n f n , ∀ n. We denote the vectorof f n by f ∈ R n .
2) Communication Model: In FEDL , regarding to the com-munication phase of UEs, we consider a time-sharing multi-access protocol (similar to TDMA) for UEs. We note that thistime-sharing model is not restrictive because other schemes, such as OFDMA, can also be applied to
FEDL . The achievabletransmission rate (nats/s) of UE n is defined as follows: r n = B ln (cid:0) h n p n N (cid:1) , (14)where B is the bandwidth, N is the background noise, p n isthe transmission power, and h n is the average channel gainof the UE n during the training time of FEDL . Denote thefraction of communication time allocated to UE n by τ n , andthe data size (in nats) of w n and ∇ F n ( w n ) by s n . Because thedimension of vectors w n and ∇ F n ( w n ) is fixed, we assumethat their sizes are constant throughout the FEDL learning.Then the transmission rate of each UE n is r n = s n /τ n , (15)which is shown to be the most energy-efficient transmissionpolicy [32]. Thus, to transmit s n within a time duration τ n ,the UE n ’s energy consumption is E n,co = τ n p n ( s n /τ n ) , (16)where the power function is p n ( s n /τ n ) := N h n (cid:16) e sn/τnB − (cid:17) (17)according to (14) and (15). We denote the vector of τ n by τ ∈ R n .Define the total energy consumption of all UEs for eachglobal round by E g , which is expressed as follows: E g := (cid:88) Nn =1 E n,co + K l E n,cp . B. Problem formulation
We consider an optimization problem, abusing the samename
FEDL , as followsminimize f,τ,θ,η,T co ,T cp K g (cid:0) E g + κ T g (cid:1) subject to (cid:88) Nn =1 τ n ≤ T co , (18) max n c n D n f n = T cp , (19) f minn ≤ f n ≤ f maxn , ∀ n ∈ N , (20) p minn ≤ p n ( s n /τ n ) ≤ p maxn , ∀ n ∈ N , (21) ≤ θ ≤ . (22)Minimize both UEs’ energy consumption and the FL timeare conflicting. For example, the UEs can save the energy bysetting the lowest frequency level all the time, but this willcertainly increase the training time. Therefore, to strike thebalance between energy cost and training time, the weight κ (Joules/second), used in the objective as an amount ofadditional energy cost that FEDL is willing to bear for oneunit of training time to be reduced, captures the Pareto-optimal We treat the case of random h n by adding the outage probabilityconstraint, e.g., for Rayleigh fading channel, Pr (cid:0) h n p n N < γ (cid:1) ≤ ζ where γ is the SNR threshold and ζ is the bounded probability [31]. This constraintis equivalent to p n ≥ − γN log(1 − ζ ) and can be integrated to the constraint (21)without changing any insights of the considered problem. tradeoff between the UEs’ energy cost and the FL time. Forexample, when most of the UEs are plugged in, then UEenergy is not the main concern, thus κ can be large. Accordingto optimization theory, /κ also plays the role of a Lagrangemultiplier for a “hard constraint” on UE energy [28].While constraint (18) captures the time-sharing uplink trans-mission of UEs, constraint (19) defines that the computingtime in one local round is determined by the “bottleneck”UE (e.g., with large data size and low CPU frequency). Thefeasible regions of CPU-frequency and transmit power of UEsare imposed by constraints (20) and (21), respectively. Wenote that (20) and (21) also capture the heterogeneity of UEswith different types of CPU and transmit chipsets. The lastconstraint restricts the feasible range of the local accuracy. C. Solutions to
FEDL
We see that
FEDL is non-convex due to the constraint (19)and several products of two functions in the objective function.However, in this section we will characterize
FEDL ’s optimalsolution by decomposing it into multiple convex sub-problems.We consider the first case when θ and η are fixed, then FEDL can be decomposed into two sub-problems as follows:
SUB1 : minimize f,T cp (cid:88) Nn =1 E n,cp + κT cp subject to c n D n f n ≤ T cp , ∀ n ∈ N , (23) f minn ≤ f n ≤ f maxn , ∀ n ∈ N . SUB2 : min. τ,T co (cid:88) Nn =1 E n,co + κT co s.t. (cid:88) Nn =1 τ n ≤ T co , (24) p minn ≤ p n ( s n /τ n ) ≤ p maxn , ∀ n. (25)While SUB1 is a CPU-cycle control problem for thecomputation time and energy minimization,
SUB2 can beconsidered as an uplink power control to determine the UEs’fraction of time sharing to minimize the UEs energy andcommunication time. We note that the constraint (19) of
FEDL is replaced by an equivalent one (23) in
SUB1 . We canconsider T cp and T co as virtual deadlines for UEs to performtheir computation and communication updates, respectively.It can be observed that both SUB1 and
SUB2 are convexproblems. SUB1
Solution:
We first propose Algorithm 2 in orderto categorize UEs into one of three groups: N is a group of“bottleneck” UEs that always run its maximum frequency; N is the group of “strong” UEs which can finish their tasks beforethe computational virtual deadline even with the minimumfrequency; and N is the group of UEs having the optimalfrequency inside the interior of their feasible sets. Lemma 2.
The optimal solution to
SUB1 is as follows f ∗ n = f maxn , ∀ n ∈ N ,f minn , ∀ n ∈ N , c n D n T ∗ cp , ∀ n ∈ N , (26) T ∗ cp = max (cid:8) T N , T N , T N (cid:9) , (27) Algorithm 2
Finding N , N , N in Lemma 2 Sort UEs such that c D f min ≤ c D f min . . . ≤ c N D N f minN Input: N = ∅ , N = ∅ , N = N , T N in (28) for i = 1 to N do if max n ∈N c n D n f maxn ≥ T N > and N == ∅ then N = N ∪ (cid:8) m : c m D m f maxm = max n ∈N c n D n f maxn (cid:9) N = N \ N and update T N in (28) if c i D i f mini ≤ T N then N = N ∪ { i } N = N \ { i } and update T N in (28) where N , N , N ⊆ N are three subsets of UEs produced byAlgorithm 2 and T N = max n ∈N c n D n f maxn ,T N = max n ∈N c n D n f minn ,T N = (cid:18) (cid:80) n ∈N α n ( c n D n ) κ (cid:19) / . (28)From Lemma 2, first, we see that the optimal solutiondepends not only on the existence of these subsets, but alsoon their virtual deadlines T N , T N , and T N , in which thelongest of them will determine the optimal virtual deadline T ∗ cp . Second, from (26), the optimal frequency of each UEwill depend on both T ∗ cp and the subset it belongs to. Wenote that depending on κ , some of the three sets (not all) arepossibly empty sets, and by default T N i = 0 if N i is an emptyset, i = 1 , , . Next, by varying κ , we observe the followingspecial cases. Corollary 2.
The optimal solution to
SUB1 can be dividedinto four regions as follows.a) κ ≤ min n ∈N α n ( f minn ) : N and N are empty sets. Thus, N = N , T ∗ co = T N =max n ∈N c n D n f minn , and f ∗ n = f minn , ∀ n ∈ N .b) min n ∈N α n ( f minn ) < κ ≤ (cid:0) max n ∈N c n D n f minn (cid:1) : N and N are non-empty sets, whereas N isempty. Thus, T ∗ cp = max (cid:8) T N , T N (cid:9) , and f ∗ n =max (cid:8) c n D n T ∗ cp , f minn (cid:9) , ∀ n ∈ N .c) (cid:0) max n ∈N c n D n f minn (cid:1) < κ ≤ (cid:80) n ∈N α n (cid:0) c n D n (cid:1) (cid:0) max n ∈N cnDnfmaxn (cid:1) : N and N are empty sets. Thus N = N , T ∗ cp = T N ,and f ∗ n = c n D n T N , ∀ n ∈ N .d) κ > (cid:80) n ∈N α n (cid:0) c n D n (cid:1) (cid:0) max n ∈N cnDnfmaxn (cid:1) : N is non-empty. Thus T ∗ cp = T N , and f ∗ n = (cid:40) f maxn , ∀ n ∈ N , max (cid:8) c n D n T N , f minn (cid:9) , ∀ n ∈ N \ N . (29)We illustrate Corollary 2 in Fig. 1 with four regions asfollows. All closed-form solutions are also verified by the solver IPOPT [33]. f ( G H z ) a b c d UE 1UE 2UE 3UE 4UE 5 (a) Optimal CPU frequency of eachUE. N u m b e r o f e l e m e n t s a b c d (b) Three subsets outputted byAlg. 2. T c m p ( s e c ) a b c d T * cmp T T T (c) Optimal computation time. Fig. 1: Solution to
SUB1 with five UEs. For wireless communication model, the UE channel gains follow the exponentialdistribution with the mean g ( d /d ) where g = − dB and the reference distance d = 1 m. The distance between thesedevices and the wireless access point is uniformly distributed between and m. In addition, B = 1 MHz, σ = 10 − W,the transmission power of devices are limited from . to W. For UE computation model, we set the training size D n ofeach UE as uniform distribution in − MB, c n is uniformly distributed in − cycles/bit, f maxn is uniformly distributedin . − . GHz, f minn = 0 . GHz. Furthermore, α = 2 × − and the UE update size s n = 25 , nats ( ≈ κ (i.e., κ ≤ . ): Designed for solely energyminimization. In this region, all UE runs their CPU at thelowest cycle frequency f minn , thus T ∗ cp is determined by thelast UEs that finish their computation with their minimumfrequency.b) Low κ (i.e., . ≤ κ ≤ . ): Designed for prioritizedenergy minimization. This region contains UEs of both N and N . T ∗ cp is governed by which subset has higher virtualcomputation deadline, which also determines the optimalCPU-cycle frequency of N . Other UEs with light-loadeddata, if exist, can run at the most energy-saving mode f minn yet still finish their task before T ∗ cp (i.e., N ).c) Medium κ (i.e., . ≤ κ ≤ ): Designed for balancingcomputation time and energy minimization. All UEs belongto N with their optimal CPU-cycle frequency strictlyinside the feasible set.d) High κ (i.e., κ ≥ ): Designed for prioritized computationtime minimization. High value κ can ensure the existenceof N , consisting the most “bottleneck” UEs (i.e., heavy-loaded data and/or low f maxn ) that runs their maximumCPU-cycle in (29) (top) and thus determines the optimalcomputation time T ∗ cp . The other “non-bottleneck” UEseither (i) adjust a “right” CPU-cycle to save the energyyet still maintain their computing time the same as T ∗ cp (i.e., N ), or (ii) can finish the computation with minimumfrequency before the “bottleneck” UEs (i.e., N ) as in (29)(bottom). SUB2
Solution:
Before characterizing the solution to
SUB2 , from (17) and (25), we first define two bounded valuesfor τ n as follows τ maxn = s n B ln( h n N − p minn + 1) ,τ minn = s n B ln( h n N − p maxn + 1) , which are the maximum and minimum possible fractions of T co that UE n can achieve by transmitting with its minimumand maximum power, respectively. We also define a new p ( W a tt ) UE 1UE 2UE 3UE 4UE 5 (a) UEs’ optimal transmissionpower. n ( s e c ) UE 1UE 2UE 3UE 4UE 5 (b) UEs’ optimal transmissiontime .
Fig. 2: The solution to
SUB2 with five UEs. The numericalsetting is the same as that of Fig. 1.function g n : R → R as g n ( κ ) = s n /B W (cid:0) κN − h n − e (cid:1) , where W ( · ) is the Lambert W -function. We can consider g n ( · ) as an indirect “power control” function that helps UE n controlthe amount of time it should transmit an amount of data s n by adjusting the power based on the weight κ . This functionis strictly decreasing (thus its inverse function g − n ( · ) exists)reflecting that when we put more priotity on minimizing thecommunication time (i.e., high κ ), UE n should raise thepower to finish its transmission with less time (i.e., low τ n ). Lemma 3.
The solution to
SUB2 is as followsa) If κ ≤ g − n ( τ maxn ) , then τ ∗ n = τ maxn b) If g − n ( τ maxn ) < κ < g − n ( τ minn ) , then τ minn < τ ∗ n = g n ( κ ) < τ maxn c) If κ ≥ g − n ( τ minn ) , then τ ∗ n = τ minn , and T ∗ co = (cid:80) Nn =1 τ ∗ n . This lemma can be explained through the lens of networkeconomics. If we interpret the
FEDL system as the buyer and ρ κ θ ∗ Θ η ∗ TABLE I: The solution to
SUB3 with five UEs. The numericalsetting is the same as that of Fig. 1.UEs as sellers with the UE powers as commodities, then theinverse function g − n ( · ) is interpreted as the price of energythat UE n is willing to accept to provide power service for FEDL to reduce the training time. There are two properties ofthis function: (i) the price increases with repect to UE power,and (ii) the price sensitivity depends on UEs characteristics,e.g., UEs with better channel quality can have lower price,whereas UEs with larger data size s n will have higher price.Thus, each UE n will compare its energy price g − n ( · ) withthe “offer” price κ by the system to decide how much power itis willing to “sell”. Then, there are three cases correspondingto the solutions to SUB2 .a) Low offer: If the offer price κ is lower than the minimumprice request g − n ( τ maxn ) , UE n will sell its lowest serviceby transmitting with the minimum power p minn .b) Medium offer: If the offer price κ is within the range of anacceptable price range, UE n will find a power level suchthat the corresponding energy price will match the offerprice.c) High offer: If the offer price κ is higher than the maximumprice request g − n ( τ minn ) , UE n will sell its highest serviceby transmitting with the maximum power p maxn .Lemma 3 is further illustrated in Fig. 2, showing how thesolution to SUB2 varies with respect to κ . It is observed fromthis figure that due to the UE heterogeneity of channel gain, κ = 0 . is a medium offer to UEs 2, 3, and 4, but a high offerto UE 1, and low offer to UE 5.While SUB1 and
SUB2 solutions share the same threshold-based dependence, we observe their differences as follows. In
SUB1 solution, the optimal CPU-cycle frequency of UE n depends on the optimal T ∗ cp , which in turn depends on the loads(i.e., c n D n f n , ∀ n ) of all UEs. Thus all UE load informationis required for the computation phase. On the other hand,in SUB2 solution, each UE n can independently choose itsoptimal power by comparing its price function g − n ( · ) with κ so that collecting UE information is not needed. The reason isthat the synchronization of computation time in constraint (23)of SUB1 requires all UE loads, whereas the UEs’ time-sharingconstraint (24) of
SUB2 can be decoupled by comparing withthe fixed “offer” price κ . SUB3
Solution:
We observe that the solutions to
SUB1 and
SUB2 have no dependence on θ so that the optimal T ∗ co , T ∗ cp , f ∗ , τ ∗ , and thus the corresponding optimal energy values,denoted by E ∗ n,cp and E ∗ n,cp , can be determined based on κ according to Lemmas 2 and 3. However, these solutions will Global rounds K g T r a i n i n g L o ss Logistic Regression: =1.5 =0.1=0.3=0.4=0.7 0 50 100 150 2000.320.340.360.380.40
Logistic Regression: =2
Logistic Regression: =5
Fig. 3: Effect of η on the convergence of FEDL . Trainingprocesses use full-batch gradient descent and K l = 20 .affect to the third sub-problem of FEDL , as will be shown inwhat follows.
SUB3 : minimize θ,η> (cid:16)(cid:88) Nn =1 E ∗ n,co + K l E ∗ n,cp + κ (cid:0) T ∗ co + K l T ∗ cp (cid:1)(cid:17) subject to < θ < , < Θ < . SUB3 is unfortunately non-convex. However, since thereare only two variables to optimize, we can employ numericalmethods to find the optimal solution. The numerical results inTable I show that the solution θ ∗ and η ∗ to SUB3 decreaseswhen ρ increases, which makes Θ decreases, as explained bythe results of Theorem 1. Also we observe that κ as moreeffect to the solution to SUB3 when ρ is small. FEDL
Solution:
Since we can obtain the stationarypoints of
SUB3 using Successive Convex Approximationtechniques such as NOVA [34], then we have:
Theorem 2.
The combined solutions to three sub-problems
SUB1 , SUB2 , and
SUB3 are stationary points of
FEDL . The proof of this theorem is straightforward. The idea isto use the KKT condition to find the stationary points of
FEDL . Then we can decompose the KKT condition into threeindependent groups of equations (i.e., no coupling variablesbetween them), in which the first two groups matches exactlyto the KKT conditions of
SUB1 and
SUB2 that can be solvedby closed-form solutions as in Lemmas 2, 3, and the last groupfor
SUB3 is solved by numerical methods.We then have some discussions on the combined solutionto
FEDL . First, we see that
SUB1 and
SUB2 solutions canbe characterized independently, which can be explained thateach UE often has two separate processors: one CPU formobile applications and another baseband processor for radiocontrol function. Second, neither
SUB1 nor
SUB2 dependson θ because the communication phase in SUB2 is clearlynot affected by the local accuracy of the computing problem,whereas
SUB2 considers the computation cost in one localround. However, the solutions to
SUB1 and
SUB2 , which canreveal how much communication cost is more expensive thancomputation cost, are decisive factors to determine the optimallevel of local accuracy. Therefore, we can sequentially solve
SUB1 and
SUB2 first, then
SUB3 to achieve the optimalsolutions to
FEDL . VI. E
XPERIMENTS
This section will cover the validation of the
FEDL ’s learningperformance in a heterogeneous network. The Tensorflow K g T e s t i n g A cc u r a c y MNIST B =20, =0.2FedAvg : B =20, =0 0 200 400 600 8000.800.820.840.860.880.900.92 FEDL : B =50, =0.2FedAvg : B =50, =0 0 200 400 600 8000.800.820.840.860.880.900.92 FEDL : B = , =0.2FedAvg : B = , =0FEDL : B = , =0.7FEDL : B = , =1 K g T r a i n i n g L o ss MNIST B =20, =0.2FedAvg : B =20, =0 0 200 400 600 8000.200.250.300.350.400.450.50 FEDL : B =50, =0.2FedAvg : B =50, =0 0 200 400 600 8000.200.250.300.350.400.450.50 FEDL : B = , =0.2FedAvg : B = , =0FEDL : B = , =0.7FEDL : B = , =1 Fig. 4: Effect of different batch-size with fixed value of K l =20 , ( B = ∞ means full batch size) on FEDL ’s performance. K g T e s t i n g A cc u r a c y FENIST B =20, K l =10FedAvg : B =20, K l =10FEDL : B = , K l =10 0 200 400 600 8000.5000.5250.5500.5750.6000.6250.6500.6750.700 FEDL : B =20, K l =20FedAvg : B =20, K l =20FEDL : B = , K l =20 0 200 400 600 8000.5000.5250.5500.5750.6000.6250.6500.6750.700 FEDL : B =20, K l =40FedAvg : B =20, K l =40FEDL : B = , K l =40 K g T r a i n i n g L o ss FENIST B =20, K l =10FedAvg : B =20, K l =10FEDL : B = , K l =10 0 200 400 600 8001.21.41.61.82.02.22.4 FEDL : B =20, K l =20FedAvg : B =20, K l =20FEDL : B = , K l =20 0 200 400 600 8001.21.41.61.82.02.22.4 FEDL : B =20, K l =40FedAvg : B =20, K l =40FEDL : B = , K l =40 Fig. 5: Effect of increasing local computation time on theconvergence of
FEDL .experimental results show that
FEDL gains performance im-provement from the vanilla FedAvg [4] in terms of trainingloss convergence rate and test accuracy in various settings.All codes and data are published on GitHub [35].
Experimental settings:
In our setting, the performance of
FEDL is examined by classification tasks including logisticand multinomial logistic regression with cross-entropy errorloss function (convex models) on different federated datasets(MNIST” [36], FEMNIST” [37], and ”Synthetic”). All datasetsare split randomly with 75% for training and 25% for testing.In the experiments, we consider 100 UEs concurrentlytaking part in the training process. In order to generate datasetscapturing the heterogeneous nature of FL, both the MNISTand synthetic datasets have different sample sizes based onthe power law in [9]. While in the synthetic dataset, eachUE is generated with 2 labels (making the corresponding tasklogistic regression), in MNIST each user contains two of the total of ten labels. The FEMNIST dataset for each UE is builtby partitioning the data in Extended MNIST [38] (62 labels)based on the writer of the digit or character where every UE isconsidered a writer. The number of data samples of each UE isin the ranges [55 , , [127 , , and [50 , for MNIST,FEMNIST, and ”Synthetic”, respectively. All parameters usingfor the experiments are summarized in Table. II Parameters Description ρ Condition number of F n ( · ) (cid:48) s Hessian Matrix, ρ = Lβ K g Global rounds for global model update K l Local rounds for local model update η Hyper-learning rate h k Local learning rate B Batch Size ( B = ∞ means full batch size for GD) TABLE II: Experiment parameters.
Effect of the hyper-learning rate on
FEDL ’s conver-gence:
We first verify the theoretical finding by predeterminingthe value of ρ and observing the impact of changing η on theconvergence of FEDL using a synthetic dataset. We conductthe logistic regression experiment where each UE uses a localbinary cross-entropy loss function with the l -regularizationterm λ . Using the data generation method similar to that in[39], in which users’ random samples and weights are drawnfrom N (0 , and 5% of the labels are flipped uniformlyat random, we can control the value of ρ by changing λ .Further, to allow for a heterogeneous setting, each user’s datais generated using a different mean drawn from N (0 , . Weexamine three values of ρ in Fig. 3. With all value of ρ , thereexist small enough values of η that allow FEDL to converge.
Effect of different gradient descent algorithms on
FEDL ’s performance:
As UEs are allowed to use differentgradient descent methods to minimize the local problem (2),the convergence of
FEDL can be evaluated on differentoptimization algorithms: GD and mini-batch SGD by changingthe configuration of the batch size during the local trainingprocess. Although our analysis results are based on GD,we also monitor the behavior of
FEDL using SGD in theexperiments for the comparison. While a full batch size isapplied for GD, mini-batch SGD is trained with a batch sizeof 20 and 50 samples. All algorithms use the same locallearning rate h k during the experiments. Fig. 4 demonstratesthat FEDL outperforms FedAvg on all batch size settings (theincrement in terms of testing accuracy and training loss areapproximately 1.5% and 21% respectively for the batch size20, 1% and 10% for the batch size 50, and 0.2% and 14% forthe full batch size). Note that increasing batch size causes thelower convergence speed of both
FEDL and FedAvg. However,increasing the value of η allows speeding up the convergenceof FEDL in case of GD. Even if using the larger batch sizebenefits the training time of one local iteration update, it willlead to the increase in the value of K l or η to guarantee the FEDL ’s convergence.
Effect of increasing local computation on convergencetime:
In order to validate the performance of
FEDL on adifferent value of local updates K l , in the Fig. 5, we useboth mini-batch SGD algorithm with the fixed batch size of20 and GD for the local update and increase K l from 10 to L cp f ( G H z ) (a) Impact of L cp on f ∗ n L cp T * c p ( s e c ) (b) Impact of L cp on T ∗ cp L co n ( s e c ) (c) Impact of L co on τ ∗ n L co T * c o ( s e c ) (d) Impact of L co on T ∗ co Fig. 6: Impact of UE heterogeneity on
SUB1 and
SUB2 with κ = 0 . . F E D L ' s o b j L cmp =0.15 L cmp =0.75 L cmp =150.0 (a) Impact of L cp F E D L ' s o b j L com =0.31 L com =32.85 L com =186.8 (b) Impact of L co Fig. 7: Impact of UE heterogeneity on
FEDL . Energy Cost T i m e C o s t =0.1=1.0 =10.0 L cmp =0.15 L cmp =0.75 L cmp =150.0 (a) Impact of L cp and κ . Energy Cost T i m e C o s t =0.1=1.0 =10.0 L com =0.31 L com =32.85 L com =186.8 (b) Impact of L co and κ . Fig. 8: Pareto-optimal points of
FEDL .40. When K l is small, it can be seen that FEDL using mini-batch achieves a significant performance gap over FedAvg and
FEDL using GD in terms of training loss and testing accuracy.However, the performance gap between
FEDL using SGD and
FEDL using GD is narrowed with larger K l and the rise of K l has an appreciably positive impact on the convergence time ofboth FEDL and FedAvg. It is also remarkable that when K l reaches a sufficiently large value, FEDL will slightly fluctuate.VII. N
UMERICAL R ESULTS
In this section, both the communication and computationmodels follow the same setting as in Fig. 1, except thenumber of UEs is increased to 50, and all UEs have the same f maxn = 2 . GHz and c n = 20 cycles/bit. Furthermore, wedefine two new parameters, addressing the UE heterogeneityregarding to computation and communication phases in FEDL ,respectively, as L cp = max n ∈N cnDnfmaxn min n ∈N cnDnfminn and L co = max n ∈N τ minn min n ∈N τ maxn .We see that higher values of L cp and L co indicate higher levelsof UE heterogeneity. For example, L cp = 1 ( L co = 1 ) canbe considered as high heterogeneity level due to unbalanceddata distributed and/or UE configuration (unbalanced channelgain distribution) such that UE with their minimum frequency (maximum transmission power) still have the same computa-tion (communication) time as those with maximum frequency(minimum transmission power). The level of heterogeneity iscontrolled by two different settings. To vary L cp , the trainingsize D n is generated with the fraction D min D max ∈ (cid:8) , . , . (cid:9) but the average UE data size is kept at the same value . MB for varying values of L cp . On the other hand, tovary L co , the distance between these devices and the edgeserver is generated such that d min d max ∈ (cid:8) ., . , . (cid:9) butthe average distance of all UEs is maintained at m fordifferent values of L co . Here D min and D max ( d min and d max ) are minimum and maximum data size (edge server-to-UE distance), respectively. In all scenarios, we fix L cp = 0 . when varying L co and fix L co = 0 . when varying L cp .
1) Impact of UE heterogeneity:
We first examine the impactof UE heterogeneity on
SUB1 and
SUB2 in Fig. 6, whichshows that increasing L cp and L co enforces the optimal f ∗ n and τ ∗ n having more diverse values, and thus makes increasethe computation and communication time T ∗ cp and T ∗ co , re-spectively. As expected, we observe that the high level of UEheterogeneity has negative impact on the FEDL system, asillustrated in Figs. 8a and 8b, such that the total cost (objectiveof
FEDL ) is increased with higher value of L cp and L co respectively. However, in this setting, when T cp is comparableto T co , e.g., 6.2 versus 2.9 seconds at L cp = L co = 10 , theimpacts of L cp and L co on total cost are comparable, e.g.,at κ = (0 . , , , the total cost of FEDL increases (1.1,1.05, 1.05) times and (1.15, 1.07, 1.08) times, when L cp and L co are increased from 0.15 to 150, and from 0.31 to 186.8,respectively.
2) Pareto Optimal trade-off:
We next illustrate the Paretocurve in Fig. 8. This curve shows the trade-off between theconflicting goals of minimizing the time cost K ( θ ) T g andenergy cost K ( θ ) E g , in which we can decrease one type ofcost yet with the expense of increasing the other one. Thisfigure also shows that the Pareto curve of FEDL is moreefficient when the system has low level of UE heterogeneity(i.e., small L cp and/or L co ).VIII. C ONCLUSIONS
In this paper, we studied FL, a learning scheme in whichthe training model is distributed to participating UEs perform-ing training over their local data. Although FL shows vitaladvantages in data privacy, the heterogeneity across users dataand UEs’ characteristics are still challenging problems. We proposed an effective algorithm without the i.i.d. UEs’ dataassumption for strongly convex and smooth FL’s problems andthen characterize the algorithm’s convergence. For the wirelessresource allocation problem, we embedded the proposed FLalgorithm in wireless networks which considers the trace-offsnot only between computation and communication latenciesbut also the FL time and UE energy consumption. Despitethe non-convex nature of this problem, we decomposed itinto three sub-problems with convex structure before analyz-ing their closed-form solutions and quantitative insights intoproblem design. We then verified the theoretical findings ofthe new algorithm by experiments on Tensoflow with severaldatasets, and the wireless resource allocation sub-problemsby extensive numerical results. In addition to validating thetheoretical convergence, our experiments also showed that theproposed algorithm can boost the convergence speed comparedto an existing baseline approach.R EFERENCES[1] N. H. Tran, W. Bao, A. Zomaya, M. N. Nguyen, and C. S. Hong, “Fed-erated Learning over Wireless Networks: Optimization Model Designand Analysis,” in
IEEE INFOCOM 2019
Artificial Intelligence and Statistics , Apr. 2017, pp. 1273–1282.[5] .[6] J. Koneˇcn´y, H. B. McMahan, D. Ramage, and P. Richt´arik, “FederatedOptimization: Distributed Machine Learning for On-Device Intelli-gence,” arXiv:1610.02527 [cs] , Oct. 2016.[7] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, andK. Chan, “Adaptive Federated Learning in Resource Constrained EdgeComputing Systems,”
IEEE Journal on Selected Areas in Communica-tions , vol. 37, no. 6, pp. 1205–1221, Jun. 2019.[8] V. Smith, S. Forte, C. Ma, M. Tak´aˇc, M. I. Jordan, and M. Jaggi, “Co-CoA: A General Framework for Communication-Efficient DistributedOptimization,”
Journal of Machine Learning Research , vol. 18, no. 230,pp. 1–49, 2018.[9] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith,“Federated Optimization for Heterogeneous Networks,” in
Proceedingsof the 1st Adaptive & Multitask Learning, ICML Workshop, 2019 , LongBeach, CA, 2019, p. 16.[10] C. Ma, J. Koneˇcn´y, M. Jaggi, V. Smith, M. I. Jordan, P. Richt´arik,and M. Tak´aˇc, “Distributed optimization with arbitrary local solvers,”
Optimization Methods and Software , vol. 32, no. 4, pp. 813–848, Jul.2017.[11] J. Hamm, A. C. Champion, G. Chen, M. Belkin, and D. Xuan, “Crowd-ML: A Privacy-Preserving Learning Framework for a Crowd of SmartDevices,” in
IEEE ICDCS , Jun. 2015, pp. 11–20.[12] O. Shamir, N. Srebro, and T. Zhang, “Communication-efficient Dis-tributed Optimization Using an Approximate Newton-type Method,” in
ICML , Beijing, China, 2014, pp. II–1000–II–1008.[13] J. Wang and G. Joshi, “Cooperative SGD: A unified Framework forthe Design and Analysis of Communication-Efficient SGD Algorithms,” arXiv:1808.07576 [cs, stat] , Aug. 2018.[14] S. U. Stich, “Local SGD Converges Fast and Communicates Little,” arXiv:1805.09767 [cs, math] , May 2018.[15] F. Zhou and G. Cong, “On the convergence properties of a $K$-step averaging stochastic gradient descent algorithm for nonconvexoptimization,” arXiv:1708.01012 [cs, stat] , Aug. 2017.[16] J. Koneˇcn´y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, andD. Bacon, “Federated Learning: Strategies for Improving Communica-tion Efficiency,” http://arxiv.org/abs/1610.05492 , Oct. 2016.[17] V. Smith, C.-K. Chiang, M. Sanjabi, and A. Talwalkar, “Federated Multi-task Learning,” in
NeurIPS’17 , Long Beach, California, USA, 2017, pp.4427–4437. [18] M. M. Amiri and D. G¨und¨uz, “Over-the-Air Machine Learning at theWireless Edge,” in , Jul. 2019,pp. 1–5.[19] H. Tang, S. Gan, C. Zhang, T. Zhang, and J. Liu, “CommunicationCompression for Decentralized Training,” in
NeurIPS , 2018, pp. 7663–7673.[20] T. T. Vu, D. T. Ngo, N. H. Tran, H. Q. Ngo, M. N. Dao, and R. H.Middleton, “Cell-Free Massive MIMO for Wireless Federated Learning,” arXiv:1909.12567 [cs, eess, math] , Sep. 2019.[21] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling Policiesfor Federated Learning in Wireless Networks,” arXiv:1908.06287 [cs,eess, math] , Oct. 2019.[22] Z. Yang, M. Chen, W. Saad, C. S. Hong, and M. Shikh-Bahaei, “EnergyEfficient Federated Learning Over Wireless Communication Networks,” arXiv:1911.02417 [cs, math, stat] , Nov. 2019.[23] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, andS. Wright, “ATOMO: Communication-efficient Learning via AtomicSparsification,” in
NeurIPS , 2018, pp. 9850–9861.[24] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, “ZipML:Training Linear Models with End-to-End Low Precision, and a Little Bitof Deep Learning,” in
International Conference on Machine Learning ,Jul. 2017, pp. 4035–4043.[25] M. Chen, O. Semiari, W. Saad, X. Liu, and C. Yin, “Federated EchoState Learning for Minimizing Breaks in Presence in Wireless VirtualReality Networks,” arXiv:1812.01202 [cs, math] , Sep. 2019.[26] S. J. Reddi, J. Koneˇcn´y, P. Richt´arik, B. P´ocz´os, and A. Smola,“AIDE: Fast and Communication Efficient Distributed Optimization,” arXiv:1608.06879 [cs, math, stat] , Aug. 2016.[27] Y. Nesterov,
Lectures on Convex Optimization . Springer InternationalPublishing, 2018, vol. 137.[28] S. Boyd and L. Vandenberghe,
Convex Optimization . CambridgeUniversity Press, Mar. 2004.[29] A. P. Miettinen and J. K. Nurminen, “Energy Efficiency of MobileClients in Cloud Computing,” in
USENIX HotCloud’10 , Berkeley, CA,USA, 2010, pp. 4–4.[30] T. D. Burd and R. W. Brodersen, “Processor Design for PortableSystems,”
Journal of VLSI Signal Processing System , vol. 13, no. 2-3, pp. 203–221, Aug. 1996.[31] S. Kandukuri and S. Boyd, “Optimal power control in interference-limited fading wireless channels with outage-probability specifications,”
IEEE Transactions on Wireless Communications , vol. 1, no. 1, pp. 46–55, Jan. 2002.[32] B. Prabhakar, E. U. Biyikoglu, and A. E. Gamal, “Energy-efficienttransmission over a wireless link via lazy packet scheduling,” in
IEEEINFOCOM 2001 , vol. 1, 2001, pp. 386–394.[33] A. W¨achter and L. T. Biegler, “On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming,”
Mathematical Programming , vol. 106, no. 1, pp. 25–57, Mar. 2006.[34] G. Scutari, F. Facchinei, and L. Lampariello, “Parallel and DistributedMethods for Constrained Nonconvex Optimization—Part I: Theory,”
IEEE Transactions on Signal Processing , vol. 65, no. 8, pp. 1929–1944,Apr. 2017, conference Name: IEEE Transactions on Signal Processing.[35] C. Dinh, https://github.com/CharlieDinh/FEDL Over WireLess.[36] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”
Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, Nov./1998.[37] S. Caldas, P. Wu, T. Li, J. Koneˇcn´y, H. B. McMahan, V. Smith,and A. Talwalkar, “LEAF: A Benchmark for Federated Settings,” arXiv:1812.01097 [cs, stat] , Dec. 2018.[38] G. Cohen, S. Afshar, J. Tapson, and A. van Schaik, “EMNIST: ExtendingMNIST to handwritten letters,” in , May 2017, pp. 2921–2926.[39] X. Li, W. Yang, S. Wang, and Z. Zhang, “Communication EfficientDecentralized Training with Multiple Local Updates,” arXiv:1910.09126[cs, math, stat] , Oct. 2019. A PPENDIX
A. Review of useful existing results
With Assumption 1 on L -smoothness and β -strong convex-ity of F n ( · ) , according to [27][Theorems 2.1.5, 2.1.10, and2.1.12], we have the following useful inequalities L ( F n ( w ) − F n ( w ∗ )) ≥ (cid:107)∇ F n ( w ) (cid:107) , ∀ w (30) (cid:104)∇ F n ( w ) −∇ F n ( w (cid:48) ) , w − w (cid:48) (cid:105) ≥ L (cid:13)(cid:13) ∇ F n ( w ) −∇ F n ( w (cid:48) ) (cid:13)(cid:13) (31) β ( F n ( w ) − F n ( w ∗ )) ≤ (cid:107)∇ F n ( w ) (cid:107) , ∀ w (32) β (cid:107) w − w ∗ (cid:107) ≤ (cid:107)∇ F n ( w ) (cid:107) , ∀ w, (33) where w ∗ is the solution to problem min w ∈ R d F n ( w ) . B. Proof of Lemma 1
Due to L -smooth and β -strongly convex J n , from (30) and(32), respectively, we have J tn ( z k ) − J tn ( z ∗ ) ≥ (cid:13)(cid:13) ∇ J tn ( z k ) (cid:13)(cid:13) L ,J tn ( z ) − J tn ( z ∗ ) ≤ (cid:13)(cid:13) ∇ J tn ( z ) (cid:13)(cid:13) β Combining these inequalities with (8), and setting z = w t − and z k = w tn , we have (cid:13)(cid:13) ∇ J tn ( w tn ) (cid:13)(cid:13) ≤ c Lβ (1 − γ ) k (cid:13)(cid:13) ∇ J tn ( w t − ) (cid:13)(cid:13) Since (1 − γ ) k ≤ e − kγ , the θ -approximation condition (3) issatisfied when c Lβ e − kγ ≤ θ . Taking log both sides of the above, we complete the proof.
C. Proof of Theorem 1
We remind the definition of J tn ( w ) as follows J tn ( w ) = F n ( w ) + (cid:104) η ∇ ¯ F t − − ∇ F n ( w t − ) , w (cid:105) . (34) Denoting ˆ w tn the solution to min w ∈ R d J tn ( w ) , we have ∇ J tn ( w t − ) = η ∇ ¯ F t − , (35) ∇ J tn ( ˆ w tn ) = 0 = ∇ F n ( ˆ w tn ) + η ∇ ¯ F t − − ∇ F n ( w t − ) . (36) Since F ( · ) is also L -Lipschitz smooth (i.e., (cid:107)∇ F ( w ) − ∇ F ( w (cid:48) ) (cid:107) ≤ (cid:80) Nn =1 D n D (cid:107)∇ F n ( w ) − ∇ F n ( w (cid:48) ) (cid:107) ≤ L (cid:107) w − w (cid:48) (cid:107) , ∀ w, w (cid:48) , by using Jensen’s inequality and L -smoothness, respectively), we have F ( w tn ) − F ( w t − ) ≤ (cid:104)∇ F ( w t − ) , w tn − w t − (cid:105) + L (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) = (cid:104)∇ F ( w t − ) − ∇ ¯ F t − , w tn − w t − (cid:105) + L (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) + (cid:104)∇ ¯ F t − , w tn − w t − (cid:105) (37) ≤ (cid:13)(cid:13) ∇ F ( w t − ) − ∇ ¯ F t − (cid:13)(cid:13) (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) + L (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) + (cid:104)∇ ¯ F t − , w tn − w t − (cid:105) (38) (36) = (cid:13)(cid:13) ∇ F ( w t − ) − ∇ ¯ F t − (cid:13)(cid:13) (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) + L (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) − η (cid:104)∇ F n ( ˆ w tn ) − ∇ F n ( w t − ) , w tn − w t − (cid:105) = (cid:13)(cid:13) ∇ F ( w t − ) − ∇ ¯ F t − (cid:13)(cid:13) (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) + L (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) − η (cid:104)∇ F n ( ˆ w tn ) − ∇ F n ( w tn ) , w tn − w t − (cid:105)− η (cid:104)∇ F n ( w tn ) − ∇ F n ( w t − ) , w tn − w t − (cid:105) (39) ≤ (cid:13)(cid:13) ∇ F ( w t − ) − ∇ ¯ F t − (cid:13)(cid:13) (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) + L (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) + Lη (cid:13)(cid:13) ˆ w tn − w tn (cid:13)(cid:13) (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) (40) − η (cid:104)∇ F n ( w tn ) − ∇ F n ( w t − ) , w tn − w t − (cid:105) (31) ≤ (cid:13)(cid:13) ∇ F ( w t − ) − ∇ ¯ F t − (cid:13)(cid:13) (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) + L (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) + Lη (cid:13)(cid:13) ˆ w tn − w tn (cid:13)(cid:13) (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) − ηL (cid:13)(cid:13) ∇ F n ( w tn ) − ∇ F n ( w t − ) (cid:13)(cid:13) , (41) where (37) is by adding and subtracting ∇ ¯ F t − , (38) is byCauchy-Schwarz inequality, (39) is by adding and subtracting ∇ F n ( w tn ) , (40) is by using Cauchy-Schwarz inequality and L -Lipschitz smoothness of F n ( · ) . The next step is to boundthe norm terms in the R.H.S of (41) as follows: • First, we have (cid:13)(cid:13) ˆ w tn − w t − (cid:13)(cid:13) (33) ≤ β (cid:13)(cid:13) ∇ J tn ( w t − ) (cid:13)(cid:13) (35) = ηβ (cid:13)(cid:13) ∇ ¯ F t − (cid:13)(cid:13) . (42) • Next, (cid:13)(cid:13) ˆ w tn − w tn (cid:13)(cid:13) (33) ≤ β (cid:13)(cid:13) ∇ J tn ( w tn ) (cid:13)(cid:13) (3) ≤ θβ (cid:13)(cid:13) ∇ J tn ( w t − ) (cid:13)(cid:13) (35) = θηβ (cid:13)(cid:13) ∇ ¯ F t − (cid:13)(cid:13) . (43) • Using triangle inequality, (42), and (43), we have (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) ≤ (cid:13)(cid:13) w tn − ˆ w tn (cid:13)(cid:13) + (cid:13)(cid:13) ˆ w tn − w t − (cid:13)(cid:13) ≤ (1 + θ ) ηβ (cid:13)(cid:13) ∇ ¯ F t − (cid:13)(cid:13) . (44) • We also have (cid:13)(cid:13) ∇ F n ( w tn ) − ∇ F n ( w t − ) (cid:13)(cid:13) (34) = (cid:13)(cid:13) ∇ J tn ( w tn ) − ∇ J tn ( w t − ) (cid:13)(cid:13) ≥ (cid:13)(cid:13) ∇ J tn ( w t − ) (cid:13)(cid:13) − (cid:13)(cid:13) ∇ J tn ( w tn ) (cid:13)(cid:13) (3) ≥ (1 − θ ) (cid:13)(cid:13) ∇ J tn ( w t − ) (cid:13)(cid:13) (35) = (1 − θ ) η (cid:13)(cid:13) ∇ ¯ F t − (cid:13)(cid:13) . (45) • By definitions of ∇ F ( . ) and ∇ ¯ F t − , we have (cid:13)(cid:13) ∇ F ( w t − ) − ∇ ¯ F t − (cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N (cid:88) n =1 p n (cid:0) ∇ F n ( w tn ) − ∇ F n ( w t − ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ N (cid:88) n =1 p n (cid:13)(cid:13) ∇ F n ( w tn ) − ∇ F n ( w t − ) (cid:13)(cid:13) (46) ≤ N (cid:88) n =1 p n L (cid:13)(cid:13) w tn − w t − (cid:13)(cid:13) (47) ≤ (1 + θ ) η Lβ (cid:13)(cid:13) ∇ ¯ F t − (cid:13)(cid:13) , (48) where (46), (47) and (48) are obtained using Jensen’s in-equality, L -Lipschitz smoothness, and (44), respectively. • Finally, we have (cid:13)(cid:13) ∇ ¯ F t − (cid:13)(cid:13) ≤ (cid:13)(cid:13) ∇ ¯ F t − − ∇ F ( w t − ) (cid:13)(cid:13) + 2 (cid:13)(cid:13) ∇ F ( w t − ) (cid:13)(cid:13) (49) (48) ≤ θ ) η ρ (cid:13)(cid:13) ∇ ¯ F t − (cid:13)(cid:13) + 2 (cid:13)(cid:13) ∇ F ( w t − ) (cid:13)(cid:13) , which implies (cid:13)(cid:13) ∇ ¯ F t − (cid:13)(cid:13) ≤ − θ ) η ρ (cid:13)(cid:13) ∇ F ( w t − ) (cid:13)(cid:13) , (50) where (49) comes from the fact that (cid:107) x + y (cid:107) ≤ (cid:107) x (cid:107) +2 (cid:107) y (cid:107) for any two vectors x and y . By substituting (42), (44), (45), and (48) into (41), we have F ( w tn ) − F ( w t − ) ≤ Zβ (cid:13)(cid:13) ∇ ¯ F t − (cid:13)(cid:13) ≤ Zβ (cid:16) − θ ) η ρ (cid:17) (cid:13)(cid:13) ∇ F ( w t − ) (cid:13)(cid:13) = − − Zβ (cid:16) − θ ) η ρ (cid:17) (cid:13)(cid:13) ∇ F ( w t − ) (cid:13)(cid:13) ≤ − − Z (cid:16) − θ ) η ρ (cid:17) ( F ( w t − ) − F ( w ∗ )) (51) = − η (2( θ − − ( θ + 1) θ (3 η + 2) ρ − ( θ + 1) ηρ ) ρ (cid:16) − θ ) η ρ (cid:17) ( F ( w t − ) − F ( w ∗ )) (11) = − Θ ( F ( w t − ) − F ( w ∗ )) , (52) with Z := η ( − θ − + ( θ + 1) θ (3 η + 2) ρ + ( θ + 1) ηρ )2 ρ < and − θ ) η ρ > to ensure Θ > .By subtracting F ( w ∗ ) from both sides of (52), we have F ( w tn ) − F ( w ∗ ) ≤ (1 − Θ) (cid:0) F ( w t − ) − F ( w ∗ ) (cid:1) , ∀ n. (53) Finally, we obtain F ( w t ) − F ( w ∗ ) ≤ (cid:88) Nn =1 p n (cid:0) F ( w tn ) − F ( w ∗ ) (cid:1) (54) (53) ≤ (1 − Θ) ( F ( w t − ) − F ( w ∗ )) , (55) where (54) is due to the convexity of F ( · ) . D. Proof of Lemma 2
The convexity of
SUB1 can be shown by its strictly convexobjective in (13) and its constraints determine a convex set.Thus, the global optimal solution of
SUB1 can be found usingKKT condition [28]. In the following, we first provide theKKT condition of
SUB1 , then show that solution in Lemma 2satisfy this condition.The Lagrangian of
SUB1 is L = (cid:88) Nn =1 (cid:104) E n,cp + λ n ( c n D n f n − T cp )+ µ n ( f n − f maxn ) − ν n ( f n − f minn ) (cid:105) + κT cp where λ n , µ n , ν n are non-negative dual variables with theiroptimal values denoted by λ ∗ n , µ ∗ n , ν ∗ n , respectively. Then theKKT condition is as follows: ∂L∂f n = ∂E n,cp ∂τ n − λ n c n D n f n + µ n − ν n = 0 , ∀ n (56) ∂L∂T cp = κ − (cid:88) Nn =1 λ n = 0 , (57) µ n ( f n − f maxn ) = 0 , ∀ n (58) ν n ( f n − f minn ) = 0 , ∀ n (59) λ n ( c n D n f n − T cp ) = 0 . ∀ n (60) Next, we will show that the optimal solution according toKKT condition is also the same as that provided by Lemma 2.To do that, we observe that the existence of N , N , N and their respective T N , T N , T N produced by Algorithm 2depends on κ . Therefore, we will construct the ranges of κ such that there exist three subsets N (cid:48) , N (cid:48) , N (cid:48) of UEssatisfying KKT condition and having the same solution as thatin Lemma 2 in the following cases.a) T ∗ cp = T N ≥ max (cid:8) T N , T N (cid:9) : This happens when κ islarge enough so that the condition in line 4 of Algorithm 2satisfies because T N is decreasing when κ increase. Thuswe consider κ ≥ (cid:80) Nn =1 α n ( f maxn ) (which ensures N ofAlgorithm 2 is non-empty).From (57), we have κ = (cid:88) Nn =1 λ ∗ n , (61) thus κ in this range can guarantee a non-empty set N (cid:48) = { n | λ ∗ n ≥ α n ( f maxn ) } such that ∂E n,cp ( f ∗ n ) ∂f n − λ ∗ n c n D n f ∗ n ≤ , ∀ n ∈ N (cid:48) : f ∗ n ≤ f maxn . Then from (56) we must have µ ∗ n − ν ∗ n ≥ , thus, accordingto (58) f ∗ n = f maxn , ∀ n ∈ N (cid:48) . From (60), we see that N (cid:48) = { n : c n D n f maxn = T ∗ cp } . Hence, by the definition in (19), T ∗ cp = max n ∈N c n D n f maxn . (62) On the other hand, if there exist a non-empty set N (cid:48) = { n | λ ∗ n = 0 } , it must be due to c n D n f minn ≤ T ∗ cp , ∀ n ∈ N (cid:48) according to (60). In this case, from (56) we must have µ ∗ n − ν ∗ n ≤ ⇒ f ∗ n = f minn , ∀ n ∈ N (cid:48) .Finally, if there exists UEs with c n D n f minn > T ∗ cp and c n D n f maxn
E. Proof of Lemma 3
According to (16) and (17), the objective of
SUB2 is thesum of perspective functions of convex and linear functions,and its constraints determine a convex set; thus
SUB2 is aconvex problem that can be analyzed using KKT condition[28].The Lagrangian of
SUB2 is L = (cid:88) Nn =1 E n,co ( τ n ) + λ ( (cid:88) Nn =1 τ n − T co )+ (cid:88) Nn =1 µ n ( τ n − τ maxn ) − (cid:88) Nn =1 ν n ( τ n − τ minn ) + κT co where λ, µ n , ν n are non-negative dual variables. Then theKKT condition is as follows: ∂L∂τ n = ∂E n,co ∂τ n + λ + µ n − ν n = 0 , ∀ n (66) ∂L∂T co = κ − λ = 0 , (67) µ n ( τ n − τ maxn ) = 0 , ∀ n (68) ν n ( τ n − τ minn ) = 0 , ∀ n (69) λ ( (cid:88) Nn =1 τ n − T co ) = 0 . (70) From (67), we see that λ ∗ = κ . Let x := s n τ n B , we firstconsider the equation ∂E n,co ∂τ n + λ ∗ = 0 ⇔ N h n (cid:0) e x − − xe x (cid:1) = − λ ∗ = − κ ⇔ e x ( x −
1) = κN − h n − ⇔ e x − ( x −
1) = κN − h n − e ⇔ x = 1 + W (cid:16) κN − h n − e (cid:17) ⇔ τ n = g n ( κ ) = s n /B W (cid:0) κN − h n − e (cid:1) . Because W ( · ) is strictly increasing when W ( · ) > − ln 2 , g n ( κ ) is strictly decreasing and positive, and so is its inversefunction g − n ( τ n ) = − ∂E n,co ( τ n ) ∂τ n . Then we havea) If g n ( κ ) ≤ τ minn ⇔ κ ≥ g − n ( τ minn ) , then we have κ = λ ∗ ≥ g − n ( τ minn ) ≥ − ∂E n,co ∂τ n (cid:12)(cid:12)(cid:12) τ minn ≤ τ n . Thus, according to (66), µ ∗ n − ν ∗ n ≤ . Because both µ ∗ n and ν ∗ n cannot be positive, we have µ ∗ n = 0 and ν ∗ n ≥ .Then we consider two cases of ν ∗ : a) ν ∗ n > , from (69), τ ∗ n = τ minn , and b) ν ∗ n = 0 , from (66), we must have κ = g − n ( τ minn ) , and thus τ ∗ n = τ minn .b) If g n ( κ ) ≥ τ maxn ⇔ κ ≤ g − n ( τ maxn ) , then we have κ = λ ∗ ≤ g − n ( τ maxn ) ≤ − ∂E n,co ∂τ n (cid:12)(cid:12)(cid:12) τ n ≤ τ maxn . Thus, according to (66), µ ∗ n − ν ∗ n ≥ , inducing ν ∗ n = 0 and µ ∗ n ≥ . With similar reasoning as above, we have τ ∗ n = τ maxn .c) If τ minn < g n ( κ ) < τ maxn ⇔ g − n ( τ maxn ) < κ