[PDF] Learning Rate Optimization for Federated Learning Exploiting Over-the-air Computation

Abstract

Federated learning (FL) as a promising edge-learning framework can effectively address the latency and privacy issues by featuring distributed learning at the devices and model aggregation in the central server. In order to enable efficient wireless data aggregation, over-the-air computation (AirComp) has recently been proposed and attracted immediate attention. However, fading of wireless channels can produce aggregate distortions in an AirComp-based FL scheme. To combat this effect, the concept of dynamic learning rate (DLR) is proposed in this work. We begin our discussion by considering multiple-input-single-output (MISO) scenario, since the underlying optimization problem is convex and has closed-form solution. We then extend our studies to more general multiple-input-multiple-output (MIMO) case and an iterative method is derived. Extensive simulation results demonstrate the effectiveness of the proposed scheme in reducing the aggregate distortion and guaranteeing the testing accuracy using the MNIST and CIFAR10 datasets. In addition, we present the asymptotic analysis and give a near-optimal receive beamforming design solution in closed form, which is verified by numerical simulations.

Full PDF

11 Learning Rate Optimization for Federated LearningExploiting Over-the-air Computation

Chunmei Xu,

Student Member, IEEE , Shengheng Liu,

Member, IEEE ,Zhaohui Yang,

Member, IEEE , Yongming Huang,

Senior Member, IEEE ,Kai-Kit Wong,

Fellow, IEEE

Abstract —Federated learning (FL) as a promising edge-learning framework can effectively address the latencyand privacy issues by featuring distributed learning at thedevices and model aggregation in the central server. Inorder to enable efﬁcient wireless data aggregation, over-the-air computation (AirComp) has recently been proposedand attracted immediate attention. However, fading ofwireless channels can produce aggregate distortions inan AirComp-based FL scheme. To combat this effect,the concept of dynamic learning rate (DLR) is proposedin this work. We begin our discussion by consideringmultiple-input-single-output (MISO) scenario, since theunderlying optimization problem is convex and has closed-form solution. We then extend our studies to moregeneral multiple-input-multiple-output (MIMO) case andan iterative method is derived. Extensive simulation resultsdemonstrate the effectiveness of the proposed scheme inreducing the aggregate distortion and guaranteeing thetesting accuracy using the MNIST and CIFAR10 datasets.In addition, we present the asymptotic analysis and give anear-optimal receive beamforming design solution in closedform, which is veriﬁed by numerical simulations.

Index Terms —Distributed algorithm, federatedlearning, over-the-air computation, learning rate,beamforming.

I. I

NTRODUCTION

Future sixth-generation (6G) communicationnetworks are envisioned to undergo a profoundtransformation, which evolves from connectedthings to connected intelligence with morestringent requirements such as dense networking,strict security, high energy efﬁciency, and highintelligence [1], [2]. Artiﬁcial intelligence (AI) C. Xu, S. Liu, and Y. Huang are with the School of InformationScience and Engineering, Southeast University, Nanjing 210096,China, and also with the Purple Mountain Laboratories, Nanjing211111, China (e-mail: { xuchunmei; s.liu; huangym } @seu.edu.cn).Z. Yang is with the Centre for Telecommunications Research,Department of Engineering, King’s College London, WC2R 2LS,UK, (e-mail: [email protected]).K.-K. Wong is with the Department of Electronic and ElectricalEngineering, University College London, London WC1E 6BT, UnitedKingdom (e-mail: [email protected]). technologies, which allows automatic analysisof a large mass of data generated in wirelessnetworks and subsequent optimization of highlydynamic and complex network [3]–[5], willshape the landscape of 6G. Conversely, 6G willgive renewed impetus to the AI-empoweredmobile applications by supplying the advancedwireless communications and mobile computingtechnologies [6] as supporting infrastructure.AI tasks entail increasingly intensive computationworkloads. Hence, they are generally migrated toand trained on the server center with sufﬁcientcomputation resources and the availability of datathat is ﬁrst collected from the devices/sensorsand then uploaded to the center [7], [8]. Thedata volume can be considerably large and, thus,imposing heavy transmission trafﬁc burden andincreasing the latency. Another critical problemcomes from the serious concern of privacy leakage,since the generated data, e.g., photos, social-networking records, at the devices are often privacysensitive. An intuitive way to counteract the aboveissues would be to conduct training and inferenceprocess directly at the network edge, such asdevices and sensors, using locally generated real-time data. The paradigm edge learning has uniqueadvantages of balanced resource support, proximityto data sources compared with cloud learning andhigher learning accuracy compared to on-devicelearning by harnessing the computation and storagecapacities [9], [10].Federated learning (FL) tackles theaforementioned concerns by collaboratively traininga shared global model with locally stored data [11]–[14]. A typical FL algorithm alternates betweentwo iterative phases: (I) The devices receive theglobal model from the edge server and trainlocal models with locally stored data; (II) Theselocal models are transmitted to and aggregated atthe edge server to yield the global model. Note a r X i v : . [ c s . I T ] M a r that the data volume of the local models (mayconsist of millions of parameters) are much smallerthan the raw data. Nonetheless, the local modelsuploaded by legion of participating devices viawireless links is resource-demanding, which is themain bottleneck to implement FL in practice. Inthis regard, developing communication-efﬁcientmethods are of paramount importance. Some recentworks have considered asynchronous mechanism[15], quantization [16], [17], sparsiﬁcation [18],[19], and aggregate frequency [20] to reduce thetransmission overhead, which, however, ignore theaspects of physical and network layers.In the second phase of FL, the edge serveraverages the local model parameters from thedistributed devices, which is essentially wirelessdata aggregation. Conventional multiple-accessschemes, such as orthogonal frequency-divisionmultiple access (OFDM), are based on theseparated-communication-and-computing principle.In [21], a time-division multiple access (TDMA)system was considered, where a joint batchsizeselection and communication resource allocationscheme was developed aiming at acceleratingthe training process and improving the learningefﬁciency. The impact of three different schedulingpolicies on the FL performance were analyticallystudied in the large-scale wireless networks[22]. Such sub-optimal communication-and-computation approaches could result in a sharp risein consumption of wireless resources as well ascongesting the air-interface [9]. Very Recently, theover-the-air computation (AirComp) scheme wasproposed by leveraging the waveform superpositionproperty of multiple-access channels, whichis fundamentally different from the traditionalseparated-communication-and-computing principle[23]. By aggregating the data simultaneouslyreceived from distributed devices in an analogmanner, the AirComp technique can furtherimprove communication efﬁciency [24]–[26].The AirComp technique has recently beenapplied to implementing FL. Speciﬁcally, a gradientsparsiﬁcation and random linear projection basedapproach is proposed and the reduced data wastransmitted via AirComp to address the bandwidthand power limitations, which outperformed itsdigital counterpart [24], [25]. As a matter offact, the distortions caused by fading and noisychannels are critical for learning tasks as a large aggregation error may lead to the degradationof inference accuracy. In [26], the transmissionpower was designed by truncated channel inversion,and two scheduling schemes were proposed toguarantee the identical amplitudes of the receivedsignals among devices in a single-input-single-output (SISO) system to reduce the aggregateerror. Meanwhile in [27], joint device selection andreceive beamforming design was investigated insingle-input-multiple-output (SIMO) conﬁguration,and a novel uniﬁed difference-of-convex (DC)function was proposed. On the other hand, theproblem of distortion minimization in an intelligentreﬂection surface (IRS)-aided cloud radio accessnetwork (CRAN) system was addressed in [28],where a joint optimization scheme of the passivebeamforming and linear detection vector wasdesigned. Furthermore, with the aid of multi-IRS,a novel resource and device selection methodwas developed to minimize the aggregate error aswell as maximize the selected devices [29]. Theexisting works utilized wireless resources, such aspower control, device selection and beamformingdesign, as well as channel conﬁguration to alignthe received signals from distributed devices.Nevertheless, they did not fully unleash the potentialof hyper-parameters in the perspective of machinelearning (ML).Learning rate is a key hyper-parameter thatdetermines the convergence and the convergencerate of the learning tasks. A large learning ratewill hinder convergence and cause loss functionaround the minimum or even to diverge, while toosmall a learning rate will lead to slow convergence[30]. To select an optimal learning rate is alwayscritical, yet tricky issue for learning algorithms towork properly. One feasible approach is to adoptlearning rate schedulers, which is able to adjustthe learning rate training online. However, it hasto be designed in advance and is unable to adaptto the characteristics of the dataset [31]. Adaptivelearning rates such as Adagrad can adapt to thedata and change with the gradients, which aremost widely used in deep learning community [32].Later, cyclical learning rate (CLR) was proposedto allow the learning rate cyclically vary betweenreasonable boundary values, which incurs lesscomputational cost and can signiﬁcantly enhance thelearning performance [33]. The essence behind CLRoriginates from the observation that increasing the learning rate allows more rapid traversal of saddlepoint plateaus and thus achieves a longer termbeneﬁcial effect. Inspired by this study, we proposedynamic learning rate (DLR) between the minimumand maximum boundaries to adapt to the fadingchannels, in order to further reduce the aggregateerror caused by the fading and noisy channels.In this paper, we consider FL for AI-empoweredmobile applications, such as e-health services,which will be supported by 6G networks. Insteadof adopting conventional separated-communication-and-computation pattern, we incorporate AirCompto aggregate local models from distributed devicesso as to improve the communication efﬁciency. InAirComp-based schemes, minimizing the resultantaggregate distortion is of paramount importance as alarge distortion can spawn performance degradationof AI tasks. To mitigate the wireless distortionmeasured by mean square error (MSE), we ﬁrstpropose to utilize a DLR scheme to adapt tothe wireless channels, and receive beamformingoptimization is jointly considered. The technicalcontributions of this work are summarized below. • To our best knowledge, this is the ﬁrst workto study the use of DLR for FL over wirelesscommunications to reduce the aggregate error,which is fundamentally different from existingworks which only consider the optimization ofwireless resources. We analytically show thatDLR can be properly designed to mitigate thedistortion caused by fading. It is also provedthat the MSE can be further decreased byconsidering DLR. • For MSE minimization via AirComp, wejointly optimize the DLR ratios and wirelessresources. Both MISO and MIMO scenariosare considered, and the respective closed-formsolution and iterative algorithm are developed.Extensive simulation results demonstrate theeffectiveness of the proposed scheme in furtherreducing the MSE as well as improving thelearning performance on MNIST and CIFAR10datasets. • In addition, we present the asymptoticbeamforming solution in closed form whenthe number of transmit/receive antennastends to inﬁnity. Simulation results verifythe theoretical analysis as well as the receivebeamforming design. The outline of this paper is organized as follows:Some necessary mathematical descriptions of FLand AirComp are presented in Section II. Theconcept of DLR is introduced in Section III. InSection IV and Section V, the DLR optimizationproblems in MISO and MIMO scenarios arerespectively formulated and solved. Next, wepresent the theoretical asymptotic analysis and, onthis basis, propose a near-optimal and closed-formreceive beamforming solution in Section VI. Then,in Section VII, numerical simulations are providedto showcase the advantages of the proposed scheme.Finally, the paper is concluded in Section VIII.II. P

RELIMINARY

In this work, we consider the problem of FLover wireless networks. The conﬁguration of thesystem under investigation is depicted in Fig. 1.The wireless network consist of K devices with N d antennas each and an aggregator with N t antennas.The set of devices are denoted as K . Each device k updates its model based on locally distributed data D k , which cannot be shared with other entities outof latency and privacy concerns. Device 2Device 1 Device K ...

Model aggregationDevice 2Device 1 Device K ...

Local model update Aggregator

Fig. 1. An FL system over wireless communication.

A. FL

FL has recently emerged as an effectivedistributed approach to enable wireless devices tocollaboratively build a shared learning model withtraining taken place locally. The objective of FL isto minimize the aggregate loss: w o (cid:44) argmin w K (cid:88) Kk =1 P k ( w ) , (1) where w ∈ R D is a vector containing the modelparameters, D is the dimension of the FL model, P k ( w ) is the local loss value at device k based on D k , given by P k ( w ) (cid:44) |D k | (cid:88) |D k | n =1 Q k ( w ; x n , y n ) , (2)where Q k is the loss function on the sample ( x n , y n ) with x n , y n the input and the label. To obtain thesolution of (1), a centralized gradient decent methodis applied and the parameters are updated as w i +1 = w i − µ (cid:18) K (cid:88) Kk =1 g k (cid:0) w i (cid:1)(cid:19) , (3)where i is the iteration index, µ is the learning rate, g k ( w i ) = ∇ w P k ( w i ) is the gradient of loss withrespect to w i . Hereinafter, we denote g k ( w i ) by g k for notation simplicity. Since the aggregator isinaccessible to the data distributed at any particulardevice k , the gradient term g k is calculated locallyand the local model, denoted as w k , is updatedaccordingly. By introducing the local learning ratedenoted as µ k , the local model at device k is updatedby w i +1 k = w i − µ k g k . (4)With local models received at the aggregator, theupdate of the global model (3) is rewritten as w i +1 = 1 K (cid:88) Kk =1 w i +1 k . (5) B. AirComp

As introduced earlier, AirComp integratescomputation and communication by exploiting thewaveform superposition property, which harnessesinterference to help functional computation [23].The AirComp comprises three stages: (i) Pre-processing at the transmitters; (ii) Superpositionover the air; and (iii) Post-processing at thereceiver [34]. In this work, the pre-processing isassumed to be identity mapping. Each parameterin a local model is modulated as a symbol, thenthe symbol vector s i +1 k (cid:44) w i +1 k ∈ C D is obtainedaccordingly. The symbol vector is assumed to benormalized by the unit variance, which is given by E (cid:104) s i +1 k (cid:0) s i +1 k (cid:1) H (cid:105) = I . For notation simplicity, the d -th element of s k , w i and g k , i.e., s k [ d ] , w i [ d ] and g k [ d ] , are written as s k , w i and g k . As such, the desired signal based on (3) can be representedby y des = 1 K (cid:88) Kk =1 s k = 1 K (cid:88) Kk =1 w i +1 k = w i − µK (cid:88) Kk =1 g k . (6)Considering the multiple access channel propertyof wireless communication, the received signal is alinear sum of the transmitted signal plus uncertainty.Hence, after the post-processing, the received signalat the receiver can be expressed as y = √ η (cid:16)(cid:88) Kk =1 A k s k + B (cid:17) , (7)where s k is the input of the communication systemfrom device k , and η is a scaling factor. Thevariables A k and B depend on the speciﬁc scenariosettings. In particular, we have A k = h k b k and B = n for SISO; A k = h T k b k and B = n for MISO. For SIMO and MIMO scenarios, wehave A k = m H h k b k , B = m H n , and A k = m H H k b k , B = m H n , respectively. Note that h k ( h k , H k ) is the independent Rayleigh fadingchannel (vector/matrix) from device k to theaggregator, which is assumed to be block fadingand remain constant during the process of modeltransmission; b k ( b k ) is the transmit coefﬁcient(vector) at device k ; m is the receive beamformingvector at the aggregator; n ( n ) is the noise (vector)with power of σ .Instead of obtaining the individual term s k ﬁrstand then averaging at the aggregator, we employAirComp to estimate the desired signal directly. TheAirComp aggregate error is deﬁned as the differencebetween the desired signal and its estimation, whichis derived as e (cid:44) y des − y = 1 K (cid:88) Kk =1 s k − √ η (cid:16)(cid:88) Kk =1 A k s k + B (cid:17) = (cid:88) Kk =1 (cid:18) K − √ ηA k (cid:19) s k − √ ηB. (8)According to (8), the aggregate distortion originatesfrom both fading and noise, which are reﬂectedin the fading-channel-related term √ ηA k and thenoise-related term √ ηB , respectively. It is worthnoting that mitigating the distortion is of greatimportance, as severe distortion can result in abiased global model and, in turn, degrades thelearning performance. III. DLR

FOR C HANNEL A DAPTION

Existing works [26]–[29] minimize the aggregateerror by means of wireless resource optimization,IRS-based channel reconﬁguration, and deviceselection. These approaches correspond to optimizethe transmit coefﬁcient (vector) b k ( b k ) , the receivebeamforming vector m , the scaling factor η , andthe passive beamforming vector Θ , or simply selecta subset of the devices. Though appears different atﬁrst glance, these approaches share a common aim,which is to align the receive signals and minimizethe noise-induced error. Whereas in this work, wetake a radically different perspective and proposeto mitigate the distortions by optimizing the hyper-parameters in the learning process. Concretely, astrategy is designed to let the local learning rate µ k adapt to the time-varying wireless environment.Based on the aforementioned features of FL andAirComp, we arrive at the following theorem. Theorem 1.

Denote DLR ratio as r k = µ k µ fordevice k , to mitigate the distortion caused by thefading channel related term √ ηA k , we have √ η (cid:88) Kk =1 A k = 1 , r k = 1 K √ ηA k . (9) Proof.

The aggregate error e ch caused by the fadingchannels can be written as e ch = y des − √ η (cid:88) Kk =1 A k s k = (cid:16) w i − µK (cid:88) Kk =1 g k (cid:17) −√ η (cid:88) Kk =1 A k (cid:0) w i − µ k g k (cid:1) = (cid:32) −√ η K (cid:88) k =1 A k (cid:33) w i + K (cid:88) k =1 (cid:16) √ ηA k µ k − µK (cid:17) g k , (10)which is mitigated if and only if both terms (cid:0) K µ − √ ηA k µ k (cid:1) and (cid:16) − √ η (cid:80) Kk =1 A k (cid:17) are .Thus, we directly obtain (9).Based on Theorem 1, the residual aggregateerror is the noise-related term √ ηB , which can bemeasured by MSE = η E (cid:0) (cid:107) B (cid:107) (cid:1) . (11)For simplicity, we consider the retransmissionmechanism such that if there exists aggregate error,the probability of retransmission is P = 1 − exp (cid:32) − a (cid:107) e (cid:107) p des (cid:33) , (12) where a is the modulation-related parameter [35], p des denotes the power of the desired signal, and e is the aggregate error. Intuitively, a larger aggregateerror leads to a larger retransmission rate.IV. P ROBLEM F ORMULATION

The objective is to minimize the MSE metricgiven in (11), subject to equality constraint (9) andboundary constraint r k = µ k /µ ∈ [ r min , r max ] . Inthis section, we establish the problem formulationsfor both MISO and MIMO cases, which will beshown in the sequel are respectively convex andnonconvex. It is important to note that SISO andSIMO can be regarded as the special cases of MISOand MIMO scenarios, respectively. A. MISO

In the MISO scenario, the devices equipped with N d antennas each transmit their models to thesingle-antenna aggregator. Under this scenario, wehave A k = h T k b k and B = n . The aggregate errormeasured by MSE is then given by MSE = η E (cid:0) (cid:107) B (cid:107) (cid:1) = η E (cid:0) (cid:107) n (cid:107) (cid:1) = ησ . (13)Since the noise power σ is independent from theoptimized variables, the problem can be formulatedas min η,b k ,r k η s . t . √ η (cid:88) Kk =1 h k b k = 1 , (14a) r k = 1 K √ η h k b k , ∀ k, (14b) r k ∈ [ r min , r max ] , ∀ k, (14c) (cid:107) b k (cid:107) ≤ P k , ∀ k, (14d)where P k is the maximum transmit power atdevice k , and h k ∈ C N d is the channel vectorbetween device k and the aggregator. Both equalityconstraints (14a) and (14b) conspire to guaranteethe elimination of error e ch caused by the wirelessfading channels as per Theorem b k is designed as b k = h H k K √ η (cid:107) h k (cid:107) r k . (15) Power constraint (14d) further suggests that K P k r k (cid:107) h k (cid:107) ≤ η , and thus we have η = max k K P k r k (cid:107) h k (cid:107) . (16)We learn from (14b) that h T k b k = K √ ηr k . Bysubstituting it back to (14a), we have (cid:80) Kk =1 1 Kr k =1 . As a result, problem (14) is equivalent to min r k max k K P k r k (cid:107) h k (cid:107) (17a) s . t . (cid:88) Kk =1 Kr k = 1 , (17b) r k ∈ [ r min , r max ] , ∀ k. (17c) Remark . As a special case of the MISO scenariowhere A k = h k b k and B = n , the problemformulated under the SISO case is similar to (17).The only difference lies in the channel and transmitcoefﬁcients, which are both complex scalars insteadof vectors in the MISO case. B. MIMO

In the MIMO scenario, each device and theaggregator are equipped with N d and N t antennas,respectively. The terms A k and B in (7) are A k = m H H k b k , B = m H n with m the receivebeamforming vector. Accordingly, the aggregateerror measured by MSE is expressed as MSE = η E (cid:0) (cid:107) B (cid:107) (cid:1) = η E (cid:16)(cid:13)(cid:13) m H n (cid:13)(cid:13) (cid:17) = σ (cid:107) m (cid:107) η, (18)where n the independent Gaussian noise vector.Based on Theorem

1, the MSE minimizationproblem can be formulated as min m ,η,r k , b k (cid:107) m (cid:107) η (19a) s . t . √ η (cid:88) Kk =1 m H H k b k = 1 , (19b) r k = 1 K √ η m H H k b k , ∀ k, (19c) r k ∈ [ r min , r max ] , ∀ k, (19d) (cid:107) b k (cid:107) ≤ P k , ∀ k, (19e)where b k ∈ C N d , H k ∈ C N t × N d , m ∈ C Nt arethe transmitting beamforming vector at device k , the channel matrix between the aggregatorand device k , and the receive beamformingvector at the aggregator, respectively. According to the constraints (19c), the optimal transmittingcoefﬁcient can be readily obtained [36], i.e., b k = H H k m K √ ηr k (cid:107) m H H k (cid:107) . (20)Power constraint (19e) further indicates that η ≥ K P k r k (cid:107) m H H k (cid:107) for each device k , which implies η = max k K P k r k (cid:107) m H H k (cid:107) . (21)Similar to the MISO scenario, the ratio r k satisﬁes (cid:80) Kk =1 1 Kr k = 1 , which can be easily derived fromthe equality constraints (14a) and (14b). Problem(19) can then be rewritten as min m ,r k max k (cid:107) m (cid:107) K P k r k (cid:107) m H H k (cid:107) (22a) s . t . K (cid:88) k =1 Kr k = 1 , (22b) r k ∈ [ r min , r max ] , ∀ k. (22c) Proposition 1.

Problem (22) is equivalent to min m ,r k max k (cid:107) m (cid:107) K P k r k (cid:107) m H H k (cid:107) (23a) (22 b ) , (22 c ) , (cid:107) m (cid:107) = 1 . (23b) Proof. ∀ m can be written as the multiplicationof its norm and the unit direction vector, i.e., m = (cid:107) m (cid:107) m (cid:107) m (cid:107) . If we let ˜ m = m (cid:107) m (cid:107) , theobjective function of problem (22) is equivalentto max k (cid:107) ˜ m (cid:107) K P k r k (cid:107) ˜ m H H k (cid:107) , where (cid:107) ˜ m (cid:107) = 1 . Thiscompletes the proof. Remark . The SIMO scenario is a special case ofthe MIMO case, where A k = m H h k b k , B = m H n .The problem formulated under SIMO case is thesame as (23) except that the channel and transmitcoefﬁcients are respectively vector h k ∈ C N t andscalar b k ∈ C .V. DLR O PTIMIZATION

In this section, we develop two algorithms tosolve problems (17) and (23), respectively. Forthe MISO scenario, problem (17) is convex anda closed-form solution is derived. For nonconvexproblem (23), we decompose the problem into twosub-problems and propose an iterative method byalternately ﬁxing one variable and solving for theother.

A. MISO

Obviously, problem (17) is convex. For notationsimplicity, we let l k = r k and c k = K √ P k (cid:107) h k (cid:107) . Assuch, problem (17) can be written as min l k max k ( c k l k ) (24a) s . t . (cid:88) Kk =1 l k = K, (24b) r max ≤ l k ≤ r min , ∀ k. (24c)The solution that minimizes max k ( c k l k ) alsominimizes max k c k l k , which indicates that theirsolutions are identical. Consequently, problem (24)is further equivalent to the following problem min l k max k c k l k s . t . (24 b ) , (24 c ) , (25)which is a typical linear programming problem.Assuming k = arg max i c i l i , we then have c i l i ≤ c k l k for all i . Following the equation constraint(24b), we have K = (cid:88) Ki =1 l i ≤ (cid:88) Ki =1 c k l k c i = c k l k (cid:88) Ki =1 c i . (26)Thus c k l k ≥ K (cid:80) Ki =1 1 c i . (27) Theorem 2.

In a MISO system, the MSE is lowerbounded by

MSE lb = σ ( (cid:80) Ki =1 √ P i (cid:107) h i (cid:107) ) . Denoting theMSE obtained with and without considering DLRas MSE d and MSE n , we always have MSE n ( a ) ≥ MSE d ( b ) ≥ MSE lb , (28) where equality ( a ) holds if and only if √ P i (cid:107) h i (cid:107) = (cid:112) P j (cid:107) h j (cid:107) , ∀ i, j ∈ K , and equality ( b ) holds if andonly if r i √ P i (cid:107) h i (cid:107) = r j (cid:112) P j (cid:107) h j (cid:107) , ∀ i, j ∈ K .Proof. According to (27), the lower bound of c k l k is K (cid:80) Ki =1 1 ci . Hence, the MSE is lower bounded by MSE lb = (cid:32) K (cid:80) Ki =1 1 c i (cid:33) σ = σ (cid:16)(cid:80) Ki =1 √ P i (cid:107) h i (cid:107) (cid:17) . (29)The lower bound is attained if and only if c i l i = c j l j , ∀ i, j ∈ K , which also means that r i √ P i (cid:107) h i (cid:107) = r j (cid:112) P j (cid:107) h j (cid:107) , ∀ i, j ∈ K . With loss of generality, we assume that c k issorted in a descending order, i.e., c i ≥ c j , ∀ i > j .Thus, the MSE without considering DLR can bereadily obtained as MSE n = c σ = σ K P (cid:107) h (cid:107) , (30)where l k can be regarded to have equal valueof . When taking the DLR into consideration,we can always ﬁnd a feasible set of coefﬁcients [ l , l , . . . , l K ] , which guarantee both c ≥ c l and c l = max ( c i l i , ∀ i ∈ K ) . Consequently, we have

MSE d = c l σ ≤ c σ = MSE n , (31)the equality of which holds if and only if when c i = c j , ∀ i, j ∈ K , i.e., √ P i (cid:107) h i (cid:107) = (cid:112) P j (cid:107) h j (cid:107) , ∀ i, j ∈K . Finally, we complete the proof.To further minimize the MSE, we shouldoptimize l i . Based on the above analysis, the optimalsolution of l i under constraint (24c) is given by l i = clip (cid:18) c k l k c i , (cid:20) r max , r min (cid:21)(cid:19) , (32)where K (cid:88) i =1 l i = K. (33)Operation clip ( x, [ a, b ]) truncates x to the speciﬁedinterval [ a, b ] .The overall procedure to solve problem (24) isshown in Algorithm

1. According to (32), theproposed scheme needs to know the user index k with the maximum value c k l k . To ﬁnd the deviceindex, we exhaustively search all devices, whichindicates that the number of iterations in the outerlayer is K . For a given device index k , the bisectiontechnique is applied to ﬁnd a solution l k , thecomplexity of which is O (log (1 /δ )) with accuracy δ . B. MIMO

Suppose that c k = (cid:107) m (cid:107) K √ P k (cid:107) m H H k (cid:107) , l k = r k , and c k l k = max i ( c i l i ) , we have the following theorem. Theorem 3.

In a MIMO system, the MSE is lowerbounded by

MSE lbm = σ ( (cid:80) Ki =1 √ P i (cid:107) m H H i (cid:107) ) for anygiven m , and we always have MSE n ( a ) ≥ MSE d ( b ) ≥ MSE lbm , (34) Algorithm 1

Optimal Learning Rate for MISO.

Input : c k , obj = max k ( c k ) r min , δ , Num = 20

Output : l opt i , obj Initialize : l min k = r max , l max k = r min for k = 1 : K while | (cid:80) Ki =1 l i − K | ≤ δ l k = ( l max k + l min k ) / l i = clip (cid:16) c k l k c i , [ r max , r min ] (cid:17) , ∀ i ∈ K if ( (cid:80) Ki =1 l i ≥ K ) l max k = l k else l min k = l k if obj ≥ max( c i l i ) obj = max( c i l i ) l opt i = l i , ∀ i ∈ K where the equalities of ( a ) and ( b ) hold when √ P i (cid:13)(cid:13) m H H i (cid:13)(cid:13) = (cid:112) P j (cid:13)(cid:13) m H H j (cid:13)(cid:13) , ∀ i, j ∈ K , and r i √ P i (cid:13)(cid:13) m H H i (cid:13)(cid:13) = r j (cid:112) P j (cid:13)(cid:13) m H H j (cid:13)(cid:13) , ∀ i, j ∈ K ,respectively.Proof. According to the equality constraints (22b)and (23b), we arrive at c k l k ≥ K K (cid:80) i =1 c i = (cid:107) m (cid:107) K (cid:80) i =1 √ P i (cid:107) m H H i (cid:107) = 1 K (cid:80) i =1 √ P i (cid:107) m H H i (cid:107) . (35)Thus, MSE lbm = σ ( (cid:80) Ki =1 √ P i (cid:107) m H H i (cid:107) ) is the lowerbound of MSE for any given feasible m , whichis achieved only if c i l i = c j l j , ∀ i, j ∈ K , i.e., r i √ P i (cid:13)(cid:13) m H H i (cid:13)(cid:13) = r j (cid:112) P j (cid:13)(cid:13) m H H j (cid:13)(cid:13) , ∀ i, j ∈ K .We ﬁrst put aside the DLR and denote theequivalent channel as h (cid:48) i = m H √ P i H i ∈ C N d . Assuch, the problem of MSE minimization becomesto ﬁnd m that maximizes the minimum (cid:96) -normof h (cid:48) i . Denoting its optimal solution as m ∗ , thecorresponding minimum (cid:96) -norm and MSE can beexpressed as h normmin = min (cid:16) √ P i (cid:13)(cid:13)(cid:13) ( m ∗ ) H H i (cid:13)(cid:13)(cid:13)(cid:17) , and MSE n = σ ( Kh normmin ) . Then, we consider the DLR inthe following two cases.Case 1: The lower bound is not attained, i.e., MSE n > MSE lbm . Without loss of generality,assume that c k is sorted in an ascending order,i.e., (cid:107) h (cid:48) i (cid:107) ≥ (cid:13)(cid:13) h (cid:48) j (cid:13)(cid:13) , ∀ i > j . There always existsa feasible set of DLR coefﬁcients [ r , r , . . . , r K ] such that r K (cid:107) h (cid:48) K (cid:107) > (cid:107) h (cid:48) K (cid:107) with r K (cid:107) h (cid:48) K (cid:107) = min ( r i (cid:107) h (cid:48) i (cid:107) ) . Thus, MSE n > MSE d ( b ) ≥ MSE lbm ,and the equality of ( b ) holds only when r i (cid:107) h (cid:48) i (cid:107) = r j (cid:13)(cid:13) h (cid:48) j (cid:13)(cid:13) , ∀ i, j ∈ K .Case 2: The lower bound is achieved i.e., MSE n = MSE lbm . In this case, we have (cid:107) h i (cid:107) = (cid:107) h j (cid:107) , ∀ i, j ∈ K and MSE n = MSE d = MSE lbm .The introduction of DLR cannot further improve theperformance, and DLR in this case is equal to .Therefore, we complete the proof.In the MIMO scenario, problem (23) is difﬁcultdue to the nonconvex objective (23a) and constraint(23b). By introducing an auxiliary variable τ ,problem (23) is further equivalent to the followingproblem: min m ,r k ,τ τ (36a) s . t . (cid:107) m (cid:107) ≤ τ K P k (cid:13)(cid:13) m H H k (cid:13)(cid:13) r k , ∀ k, (36b) (22 b ) , (22 c ) , (23 b ) . To solve problem (36), we utilize the iterativetechnique and decompose the problem into two sub-problems by alternately ﬁxing the DLR ratio r k andthe receive beamforming vector m , respectively.Given the DLR ratio r k , the original problem (36)is reduced into the following sub-problem: min m ,τ τ s . t . (23 b ) , (36 b ) , (37)which is nonconvex due to constraints (23b) and(36b). To solve problem (37), we have the followinglemma. Lemma 1.

Suppose that we have a semideﬁnitematrix A k = H k H H k , if det A k > , the rangeof (cid:107) m (cid:107) K P k (cid:107) m H H k (cid:107) r k is (cid:104) K P k r k λ max , K P k r k λ k, min (cid:105) ;otherwise (cid:104) K P k r k λ k, max , ∞ (cid:105) , where λ k, max and λ k, min are the maximum and minimum eigenvaluesof A k , respectively.Proof. Deﬁne a function g ( m ) = (cid:107) m H H (cid:107) (cid:107) m (cid:107) = m H HH H mm H m , which is the Rayleigh-Ritz of matrix A = HH H . Suppose that the maximum andminimum eigenvalues of A are respectively λ max and λ min , function g ( m ) is within the interval of [ λ min , λ max ] . Apparently, A ( (cid:54) = ) is semideﬁniteand we have det A ≥ . In the case of det A > , the eigenvalues of A are positive. Consequently, wehave λ max ≤ g ( m ) = (cid:107) m (cid:107) (cid:107) m H H (cid:107) ≤ λ min . (38)Otherwise, λ min = 0 , we have λ max ≤ g ( m ) = (cid:107) m (cid:107) (cid:107) m H H (cid:107) ≤ ∞ . (39)Therefore, the proof is complete. Remark . Let τ l ow k = K P k r k λ k, max , τ up k = K P k r k λ k, min if λ min > ; otherwise τ up k = ∞ , andthen (cid:107) m (cid:107) K P k (cid:107) m H H k (cid:107) r k ∈ (cid:2) τ low k , τ up k (cid:3) . The necessarycondition of τ that guarantees the feasibility ofproblem (37) is τ ∈ (cid:2) τ low , τ up (cid:3) , where τ low = min k (cid:0) τ low k (cid:1) , τ up = max k ( τ up k ) . (40)For any given τ ∈ (cid:2) τ low , τ up (cid:3) , sub-problem (37)is interpreted as ﬁnding the receive beamformingvector m that makes the sub-problem feasible,which is a check problem. By introducing M = mm H , the sub-problem given τ is converted to min M (41a) s . t . Tr ( M ) ≤ τ K P k Tr (cid:0) M H k H H k (cid:1) r k , ∀ k, (41b) M (cid:60) (41c) Tr ( M ) = 1 (41d) rank ( M ) = 1 . (41e)Indeed, the only difﬁculty of the above problemlies in the rank one constraint (41e). The problemcan be solved by ﬁrst dropping the constraint(41e) to obtain solution M ∗ . Then, the receivebeamforming vector m ∗ can be calculated usingthe eigenvector approximation method or therandomization technique [37], which is sub-optimalespecially when M is large. To guarantee the rankone constraint, we utilize a DC representation [27],[38], and convert problem (41) to min M Tr ( M ) −(cid:107) M (cid:107) s . t . (41 a ) , (41 b ) , (41 d ) , (41 e ) , (42)which can be efﬁciently solved by DC programmingwith complexity of O ( N t ) .To ﬁnd the solution τ , we utilize the bisectiontechnique. Speciﬁcally, the interval (cid:2) τ low , τ up (cid:3) isdivided into two sub-intervals (cid:2) τ low , τ (cid:3) and [ τ, τ up ] with τ = (cid:0) τ low + τ up (cid:1) / . If problem (42) issolved for τ , then the solution is within (cid:2) τ low , τ (cid:3) ;otherwise we have [ τ, τ up ] . Repeatedly checkingthe problem until τ up − τ low < δ , where δ isthe accuracy. Such bisection technique involves log δ repetitive operations and, hence, solving theproblem (37) requires the computational complexityof O (cid:0) N t log δ (cid:1) . The algorithm is summarized in Algorithm Algorithm 2

Beamforming design for MIMO.

Input : H H k , r k , δ Output : m , τ calculate τ low , τ up based on (40) while ( τ up − τ low ) > δ if problem (42) is infeasible τ up = ( τ up + τ low ) / else τ low = ( τ up + τ low ) / Given the obtained receive beamforming vector m , the original problem (36) is reduced to the sub-problem below: min r k ,τ τ s . t . (22 b ) , (22 c ) , (36 b ) . (43)Denoting the equivalent channel vector h (cid:48) k as m H H k , the sub-problem (43) can be transformedinto the problem under the MISO case with channelvector h (cid:48) k . Note that the sub-problem under theSIMO case is equivalent to that of the SISO caseby deﬁning equivalent channel coefﬁcient h (cid:48) k as m H h k . Therefore, we can readily derive the closed-form solution in the MISO case. We let l k = r k , c k = K √ P k (cid:107) h (cid:48) k (cid:107) = K √ P k (cid:107) m H H k (cid:107) , and then thesolution is given by l i = clip (cid:16) c k l k c i , [ r max , r min ] (cid:17) ,where (cid:80) Ki =1 l i = K .Thus, the sub-optimal solution of problem(36) can be obtained by alternatively solvingsub-problems (42) and (43). For eachiteration, the computational complexity is O (cid:16) N t log δ + K log δ (cid:17) , where δ and δ are the accuracy of solving receive beamformingvector m and DLR ratio r k , respectively. Theiterative method proposed is summarized in Algorithm Algorithm 3

Iterative Learning Rate and ReceiveBeamforming.

Input : H k Output : m , r k , τ Initialize : r k = 1 do loop given r k , solve problem (37) using Algorithm given m , solve problem (43) using Algorithm until τ convergesVI. A SYMPTOTIC A NALYSIS AND R ECEIVE B EAMFORMING D ESIGN

This section presents the theoretical analysisof the MSE and the DLR ratio in the MISO,SIMO, and MIMO scenarios, when the numbers ofantennas N d and N t increase to inﬁnity. Based onthe asymptotic analysis, we propose a near-optimaland closed-form receive beamforming solution.Note that each device is assumed to have equalmaximum power P k = P . A. MISO

In the MISO case, we present asymptotic analysiswhen the number of antennas N d at the devices goesto inﬁnity. Since the channels between the devicesand the aggregator are assumed to be independentlyRayleigh distributed, we have (cid:107) h k (cid:107) → (cid:112) N d , (44) c k = 1 K √ P (cid:107) h k (cid:107) → K √ P √ N d , (45)which suggest that c i = c j , ∀ i, j ∈ K . As aconsequence, lower bound MSE lb can be achievedaccording to Theorem r k become MSE = (cid:32) √ P (cid:80) Ki =1 (cid:107) h i (cid:107) (cid:33) σ → σ P K N d , (46) r k = 1 l k = c k √ P (cid:88) Ki =1 (cid:107) h i (cid:107) = (cid:80) Ki =1 (cid:107) h i (cid:107) K (cid:107) h k (cid:107) → , (47)where the achieved MSE is inversely proportionalto K and N d . B. SIMO

In the SIMO case, we let h (cid:48) k = m H h k , c k = K √ P (cid:107) h (cid:48) k (cid:107) and l k = r k . With the increase number ofantennas N t , the channels between the devices andthe BS become asymptotically orthogonal, i.e., (cid:104) h i , h j (cid:105) → (cid:40) N t i = j i (cid:54) = j . (48)By exploiting the above property, the receivebeamforming vector m can be designed simply as m = Σ Kk ( h k / (cid:107) h k (cid:107) ) (cid:107) Σ Kk ( h k / (cid:107) h k (cid:107) ) (cid:107) . (49)Accordingly, the equivalent channel coefﬁcient h (cid:48) i can be rewritten as h (cid:48) i = m H h i = Σ Kk (cid:0) h H k / (cid:107) h k (cid:107) (cid:1) (cid:107) Σ Kk ( h k / (cid:107) h k (cid:107) ) (cid:107) h i → (cid:114) N t K , (50)which implies that (cid:107) h (cid:48) i (cid:107) = (cid:13)(cid:13) h (cid:48) j (cid:13)(cid:13) , ∀ i, j ∈ K .Therefore, the equalities in (a) and (b) of (34) canbe guaranteed and MSE lb can be achieved accordingto the Theorem

3. In this case, the MSE and r k isderived as MSE = (cid:32) √ P (cid:80) Ki =1 (cid:107) h (cid:48) i (cid:107) (cid:33) σ → σ P KN t , (51) r k = 1 l k = c k √ P K (cid:88) i =1 (cid:107) h (cid:48) i (cid:107) = (cid:80) Ki =1 (cid:107) h (cid:48) i (cid:107) K (cid:107) h (cid:48) k (cid:107) → K (cid:113) N t K K (cid:113) N t K = 1 , (52)where the achieved MSE is inversely proportionalto K and N t . C. MIMO

In the MIMO case, both devices and theaggregator have multiple antennas. This sub-sectionpresents the analysis when N d and N t go to inﬁnity,respectively. The equivalent channel vector betweendevice k and the aggregator is denoted as h (cid:48) k = m H H k , where H k ∈ C N t × N d . Since H i H H i issemideﬁnite and (cid:107) m (cid:107) = 1 , the norm of h (cid:48) i satisﬁes (cid:107) h (cid:48) i (cid:107) = (cid:113) m H H i H H i m = (cid:115) m H H i H H i mm H m . (53)According to the Rayleigh-Ritz property, the valueof (cid:107) h (cid:48) i (cid:107) is within the range of (cid:2)(cid:112) λ i, min , (cid:112) λ i, max (cid:3) , where λ i, min and λ i, max are respectively theminimum and maximum eigenvalues of H i H H i .First, we give the analysis when N d → ∞ for N d > N t . In this case, the channel between device i and each antenna at the BS is asymptoticallyorthogonal. Consequently, we have H i H H i → N d I N t × N t , (54)whose eigenvalues share the same values of √ N d .Thus, according to the Rayleigh-Ritz property, forany beamforming vector m with (cid:107) m (cid:107) = 1 , we have (cid:107) h (cid:48) (cid:107) = (cid:107) h (cid:48) (cid:107) = . . . = (cid:107) h (cid:48) K (cid:107) → (cid:112) N d , (55)which implies that the condition of the equalitiesof (a) and (b) in (34) are guaranteed. Based on Theorem

3, the MSE and r k are obtained such that MSE = (cid:32) √ P (cid:80) Ki =1 (cid:107) h (cid:48) i (cid:107) (cid:33) σ → σ P K N d , (56) r k = 1 l k = c k √ P K (cid:88) i =1 (cid:107) h (cid:48) i (cid:107) = (cid:80) Ki =1 (cid:107) h i (cid:107) K (cid:107) h k (cid:107) → . (57)Note that the MSE achieved is inverselyproportional to K and N d , irrespective of N t when N d → ∞ for N d > N t . The reason is that H i has full row rank with equal singular values of √ N d , which indicates that the equivalent channel h (cid:48) i has the same power of √ N d regardless of N t .Next, the analysis of the MSE when N t → ∞ for N t > N d is provided. The channels between eachantenna of device i and the BS are asymptoticallyorthogonal with power N t . Thus, the rank of H i is r = rank ( H i ) = N d . By using singular valuedecomposition (SVD), we readily have H i H H i = U Σ U H → U (cid:20) N t I N d × N d

00 0 (cid:21) U H , (58)where the ﬁrst N d columns of U are U r = (cid:104) H i [: , (cid:107) H i [: , (cid:107) , H i [: , (cid:107) H i [: , (cid:107) , . . . , H i [: ,N d ] (cid:107) H i [: ,N d ] (cid:107) (cid:105) .Thus, the minimum and maximum eigenvaluesof H i H H i are respectively λ i, min = 0 , λ i, max = N t . Besides, the asymptotically orthogonal propertydictates that the column spaces spanned by H k , ∀ k ∈ K are orthogonal, i.e., span ( H i ) ⊥ span ( H j ) , ∀ i (cid:54) = j, (59) which suggests that h i ⊥ h j for any h i ∈ span ( H i ) and h j ∈ span ( H j ) .Therefore, the receive beamforming vector can bedesigned in a simple manner as m = (cid:101) h + (cid:101) h + . . . + (cid:101) h K (cid:13)(cid:13)(cid:13)(cid:101) h + (cid:101) h + . . . + (cid:101) h K (cid:13)(cid:13)(cid:13) , (60)where (cid:101) h i is the eigenvector of H i H H i , which canbe any column vector of U r . Further, the equivalentchannels can be expressed as h (cid:48) i = m H H i = (cid:16)(cid:101) h + (cid:101) h + . . . + (cid:101) h K (cid:17) H (cid:13)(cid:13)(cid:13)(cid:101) h + (cid:101) h + . . . + (cid:101) h K (cid:13)(cid:13)(cid:13) H i → (cid:101) h H i H i √ K = √ N t √ K I e , (61)where I e = [0 , . . . , (cid:124) (cid:123)(cid:122) (cid:125) e − , . . . , (cid:124) (cid:123)(cid:122) (cid:125) N d − e ] H if the e -th column of U r is selected. Hence, we have (cid:107) h (cid:48) i (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) √ N t √ K I e (cid:13)(cid:13)(cid:13)(cid:13) → (cid:114) N t K , ∀ i. (62)According to Theorem

3, the MSE and r k can beobtained as MSE = (cid:32) √ P (cid:80) Ki =1 (cid:107) h (cid:48) i (cid:107) (cid:33) σ → σ P KN t , (63) r k = 1 l k = c k √ P K (cid:88) i =1 (cid:107) h (cid:48) i (cid:107) = (cid:80) Ki =1 (cid:107) h (cid:48) i (cid:107) K (cid:107) h (cid:48) k (cid:107) → . (64)Note that the achieved MSE is inverselyproportional to K and N t , irrespective of N d when N t → ∞ for N t > N d . This is becausethe projections of the designed m on sub-spaces span ( H i ) , ∀ i ∈ K have the same power (cid:107) m (cid:107) K = K ,since H i is a column full rank matrix with equalsingular values N t and the spanned sub-spaces areorthogonal. D. Observations

We observe the following facts from the aboveasymptotic analysis: • A larger number of antennas evidently providesmore degree of freedom to align the signalsfrom the distributed devices. This in turn K -20-15-10-505 M SE / ( d B ) NDLRDLR (a) K -45-40-35-30-25-20-15-10 M SE / ( d B ) NDLR,Nd=4DLR,Nd=4NDLR,Nd=8DLR,Nd=8NDLR,Nd=16DLR,Nd=16 (b) K -35-30-25-20-15-10-5 M SE / ( d B ) SDR,Nt=4DC,Nt=4DLR,Nt=4SDR,Nt=8DCe,Nt=8DLR,Nt=8SDR,Nt=16DC,Nt=16DLR,Nt=16 (c) K -40-35-30-25-20-15-10-5 M SE / ( d B ) SDR,Nd=2,Nt=2DC,Nd=2,Nt=2DLR,Nd=2,Nt=2SDR,Nd=2,Nt=4DC,Nd=2,Nt=4DLR,Nd=2,Nt=4SDR,Nd=4,Nt=8DC,Nd=4,Nt=8DLR,Nd=4,Nt=8 (d)Fig. 2. The aggregate error versus the number of devices K . (a) SISO scenario. (b) MISO scenario with N d = 4 , , . (c) SIMO scenariowith N t = 4 , , . (d) MIMO scenario with N d = 2 , N t = 2 , N d = 2 , N t = 4 , N d = 4 , N t = 8 . brings down the performance gain obtained byemploying DLR, and the DLR ratio is pushedcloser to . • When the number of antennas increases toinﬁnity, the designed receive beamforming m by simply summing up the normalized channelvectors can achieve the lower bound MSE lbm ,as the channel vectors under this case areasymptotically orthogonal with equal power. • The MSE is inversely proportional to thenumber of devices K and the number ofantennas N t in the SIMO case, while the MSEis inversely proportional to K N d in the MISOcase. The reason, plainly, is that the equivalentchannel power in the SIMO case is (cid:113) N t K ,whereas the channel power in the MISO caseis √ N d . • In the MIMO scenario, two cases, i.e. N d → ∞ for N d > N t , and N t → ∞ for N d > N t , areconsidered. The attained MSE is σ P K N d in theformer case, which shares the same value withthe MISO case regardless of N t . Similarly, thelatter case is shown to have identical MSE, i.e., σ P KN t with the SIMO scenario, independentof N d . Such results can be easily explainedby comparing the equivalent channels in theMIMO case and the channels in the MISO andSIMO scenarios. VII. S IMULATION R ESULTS

Simulation results are given in this sectionto demonstrate the effectiveness of the proposedDLR design and the performance of the proposednear-optimal and closed-form receive beamformingsolution when massive antennas are applied. Theproposed method is compared with the existingapproaches without DRL. The performance of thecompared method is labelled as ‘NDLR’ in theSISO and MISO scenarios, which does not optimizethe receive beamforming but only optimizes thetransmit coefﬁcients (vector) b k ( b k ) using themethod proposed in [36]. In the SIMO and MIMOscenarios, the SDR and DC methods are comparedto obtain receive beamforming vector m , which arelabelled as ‘SDR’ and ‘DC’. It should be notedthat all the devices participate in the update of theglobal model. We set the maximum transmit powerof each device k as P k = 0 dB, which experiencesindependent Rayleigh fading. To show the impactof DLR on the training and inference performance,we use FL to implement the classiﬁcation tasks onMNIST and CIFAR10 datasets. Assume that thedata stored at each device has equal size. To ensurethat the algorithm has adequate supplies of data tosupport feature extraction and meaningful learning, K ≥ has to be satisﬁed. MLP and ResNet18neural networks are adopted to train on these twodatasets for epochs, where µ is set to . . A. Performance on MSE using DLR

To showcase the effectiveness of the proposedDLR, we conduct simulations under the SISO,MISO, SIMO and MIMO scenarios, where theboundaries of DLR are set to r min = 1 / . and r max = 1 / . . Fig. 2 displays MSE /σ with respectto the number of devices K . It shows that theaggregate error decreases with the increase of devicenumber K in four scenarios, which is due to theaveraging operation over K devices. Speciﬁcally,more devices indicate smaller scaling factor η andtherefore smaller error from (8). The aggregate erroris further reduced by additionally considering DLR,compared to the methods utilizing only on wirelessresource in [27], [36], which validates Theorem 2 and

Theorem 3 . It also reveals that performancegap extended with the increase of the number ofdevices K . The reason behind is that the increase of K lead to a larger difference between the maximumand minimum signal power. Compared with theSISO scenario, multiple antennas at the devicesor/and the aggregator offer diversity gain to combatfading. Thereby, more antenna deployment resultsin a smaller aggregate error while the performancegain obtained by DLR is shrinking. K -32-30-28-26-24-22-20-18-16 M SE / ( d B ) SDR,Nd=2,Nt=4DC,Nd=2,Nt=4DLR,Nd=2,Nt=4Near-Optimal, Nd=2,Nt=4

Fig. 3. The optimality of the proposed iterative method.

To showcase the optimality of the proposediterative algorithm, we conduct numericalsimulations to obtain the near-optimal solution inthe MIMO scenario, as problem (23) is nonconvexand the optimal solution is difﬁcult to obtain.Note that the near-optimal solution is obtainedby initializing starting points and selecting theone with the minimum MSE. Fig. 3 reveals thatthe proposed iterative learning rate and receive beamforming algorithm can achieve relatively closeperformance with the near-optimal solution. Inorder to examine the impact of DLR, we showthe channel gain/equivalent channel power and thecorresponding DLR values in Fig. 4. For clearillustration, the devices are indexed by the channelgain/equivalent channel gain in an ascendingorder. As displayed in Fig. 4, the learning rate µ k = r k µ is smaller for device k with higherchannel gain/equivalent channel gain, while alarger learning rate is used for the local modelupdate with lower channel gain/equivalent channelgain, which can be explained using (9).Simulations are conducted on differentboundaries of DLR ratios, i.e., r min and r max ,under four scenarios to illustrate their impact. Notethat the case of r min = r max = 1 is equivalentto conventional methods without consideringDLR. Fig. 5 shows that a larger range of DLRratio leads to the decreasing trend of MSE.In Fig. 5(c) and Fig. 5(d), the performance inthe cases of /r max = 0 . , /r min = 1 . and /r max = 0 . , /r min = 1 . is close when K ≤ .Recall that the DLR is inversely proportionalwith the equivalent channel gain. Such closeperformance can be explained by the reason thatthe obtained receive beamforming vector m canwell combat the distortion due to the fadingchannels when K is small. B. Performance of Learning Task Using DLR

To investigate the impact of DLR on the trainingand testing performance of FL tasks, we utilizeFL to perform classiﬁcation tasks on MNIST andCIFAR10 datasets. Fig. 6 gives the training lossas well as the test accuracy on both datasets. K = 20 devices are involved in updating theglobal model, and the parameter a and the noisepower σ are respectively set to dB and dB. Compared to CIFAR10 including classesof color pictures, MNIST dataset comprising onlyblack and white pictures is known to be muchmore easier to learn. Thus, the accuracy trainedon the MNIST can achieve very soon andapproach almost , while the accuracy onCIFAR10 is lower with approximate . Since theMSE with DLR is smaller than the MSE with aﬁxed learning rate, its re-transmission probabilityis smaller. The DLR-based scheme is shown to Device index E qu i v a l en t c hanne l po w e r ( d B ) D L R (a) Device index E qu i v a l en t c hanne l po w e r ( d B ) D L R (b)Fig. 4. The equivalent channel power and the corresponding DLR. (a) MISO scenario with N d = 8 , (b) MIMO scenario with N d = 4 , N t = 4 . K -25-20-15-10-50510 M SE / ( d B ) r k [1/1.6, 1/0.4]r k [1/1.4, 1/0.6]r k [1/1.2, 1/0.8]r k [1, 1] (a) K -40-35-30-25-20-15-10 M SE / ( d B ) r k [1/1.6, 1/0.4]r k [1/1.4, 1/0.6]r k [1/1.2, 1/0.8]r k [1, 1] (b) K -32-30-28-26-24-22-20-18-16-14-12 M SE / ( d B ) r k [1/1.6, 1/0.4]r k [1/1.4, 1/0.6]r k [1/1.2, 1/0.8]r k [1, 1] (c) K -35-30-25-20-15-10 M SE / ( d B ) r k [1/1.6, 1/0.4]r k [1/1.4, 1/0.6]r k [1/1.2, 1/0.8]r k [1, 1] (d)Fig. 5. The impact of r max and r min . (a) SISO scenario. (b) MISO scenario with N d = 4 . (c) SIMO scenario with N t = 4 . (d) MIMOscenario with N d = 2 , N t = 2 . have slight higher test accuracy on both datasetscompared to conventional methods using a ﬁxedlearning rate, i.e., r min = r max = 1 . This is dueto the fact that a increased learning rate owing toadaption to the fading channels can help escape thesaddle point which is known as the difﬁculty inminimizing the loss. In addition, a larger range ofDLR may result in a bigger variance of the trainingloss and the test accuracy, which implies that properboundaries of DLR ratio should be chosen.Further numerical simulations are conductedunder the MISO and MIMO scenarios with K =4 , , devices, where N d = 4 , and N d = 2 , N t =4 , respectively. The DLR ratio boundaries are setto r min = 1 / . and r max = 1 / . , and the noise power is set to dB. The reported accuracyperformance on both datasets is given in Table I.The assumption of sufﬁcient data suggests that thetotal size of data under K = 4 and K = 12 casesis insufﬁcient. As a result, the test accuracy on bothdatasets is lower when and devices participatein aggregating the global model, compared with devices. It is worth noting that the reported accuracyusing DLR may be smaller than that with a ﬁxedlearning rate due to the variance. Therefore, thesimulation results in Fig. 6 and Table I demonstratethat the proposed DLR can slightly improve thelearning and inference performance compared withthe ﬁxed-learning-rate-based approach. Epochs T r a i n i ng l o ss r [1,1]r [1/1.4,1/0.6]r [1/1.8,1/0.2] (a) Epochs T e s t a cc u r a cy ( % ) r [1,1]r [1/1.4,1/0.6]r [1/1.8,1/0.2]

60 80 100 120 140 160 180 200949698 (b)

Epochs T r a i n i ng l o ss r [1,1]r [1/1.4,1/0.6]r [1/1.8,1/0.2] (c) Epochs T e s t a cc u r a cy ( % ) r [1,1]r [1/1.4,1/0.6]r [1/1.8,1/0.2]

60 80 100 120 140 160 180 200607080 (d)Fig. 6. The training and testing performance with different r max and r min . (a) Training loss on MNIST dataset. (b) Accuracy performanceon MNIST dataset. (c) Training loss on CIFAR10 dataset. (d) Accuracy performance on CIFAR10 dataset.TABLE IR EPORTED ACCURACY ON

MNIST

AND

CIFAR10

WITH

EPOCHS UNDER

MISO

AND

MIMO

SCENARIOS .Scenario Number of devices K = 4 K = 12 K = 20 Dataset MNIST CIFAR10 MNIST CIFAR10 MNIST CIFAR10MISO N d = 4 Fixed learning rate 91.12% 53.11% 96.26% 62.68% 97.01% 68.72%DLR 93.29% 50.27% 96.35% 64.68% 97.17% 72.12%MIMO N d = 2 , N t = 4 Fixed learning rate 93.11% 50.96% 96.71% 63.39% 96.98% 71.29%DLR 93.90% 54.86% 96.66% 64.94% 96.94% 73.86%

C. Performance of the Proposed Closed-formReceive Beamforming Solution

Now we move on to discuss the impact of thenumber of antennas on the MSE performance andverify the asymptotic analysis and the proposedclosed-form receive beamforming design. Differentnumbers of antennas at the devices and theaggregator under the MISO, SIMO and MIMOscenarios are considered. Due to the lack of space,we restrict ourselves to the case of K = 2 and K = 4 only. The line labeled as ‘Analysis’ isthe derived theoretical MSE, and ‘Proposed’ is theperformance using the proposed simple closed-formreceive beamforming design.As shown in Fig. 7, the increase of the numberof antennas leads to a reduction of aggregate errorsince higher beamforming gain is achieved. The performance gap shrinks between the proposed DLRand ﬁxed learning rate methods for r k → when thenumber of antennas go to inﬁnity. More speciﬁcally,in the MISO scenario, more antennas equippedat the devices lead to smaller differences on thechannel gain among devices. Thus the performancegain obtained by DLR is smaller with respectto ﬁxing learning rate. Both ‘DLR’ and ‘NDLR’approach the ‘Analysis’ performance eventually,which veriﬁes the analysis under the MISO case.In the SIMO and MIMO scenarios, more antennasat the aggregator provide more freedom to alignthe received signals, which leads to the reducedperformance gap between ‘DLR’ and ‘DC’.Fig. 7 also shows that the MSE obtained bythe proposed closed-form receive beamformingdesign is approaching the theoretical bound when log Nd -40-35-30-25-20-15-10-5 M SE / ( d B ) NDLR,K=2NDLR, K=4DLR,K=2DLR,K=4Analysis,K=2Analysis,K=4 (a) log Nt -30-25-20-15-10-50 M SE / ( d B ) DC,K=2DC,K=4DLR,K=2DLR,K=4Proposed,K=2Proposed,K=4Analysis,K=2Analysis,K=4 (b) log Nd -35-30-25-20-15-10-5 M SE / ( d B ) DC,K=2DC,K=4DLR,K=2DLR,K=4Proposed,K=2Proposed,K=4Analysis,K=2Analysis,K=4 (c) log Nt -30-25-20-15-10-50 M SE / ( d B ) DC,K=2DC,K=4DLR,K=2DLR,K=4Proposed,K=2Proposed,K=4Analysis,K=2Analysis,K=4 (d)Fig. 7. The impact of antenna number and the performance of the proposed closed-form receive beamforming design. (a) MISO scenario.(b) SIMO scenario. (c) MIMO scenario with N t = 2 . (d) MIMO scenario with N d = 2 . massive antenna are applied. Besides, the MSEwithout consideration of DLR, i.e., ‘NDLR’ in theMISO case and ‘DC’ in both SIMO and MIMOcases, is getting closer to ‘Analysis’, which impliesthat the equality of (a) and (b) in both (28)and (34) are guaranteed. Based on the theoreticalanalysis in Section VI, when the the number ofreceive antennas (or transmit antennas) goes toinﬁnity, the analyzed MSE is the same for bothMIMO and MISO (or SIMO). From Fig. 7 weobserve that the number of transmit (or receive)antennas can affect the changing rate of MSEcurve when the number of the receive (or transmit)antennas goes to inﬁnity. Therefore, the asymptoticanalysis is veriﬁed and the effectiveness of theproposed closed-form receive beamforming designin Section VI is conﬁrmed.VIII. C ONCLUSIONS

In this paper, we incorporated the Aircomptechnique in implementing distributed learningtasks to signiﬁcantly improve the communicationefﬁciency. Minimizing the aggregate error due to thefading and noisy channels is critical as a large errorcan lead to poor training and inference performance.Different from existing works that mainly optimizedthe wireless resources to align the received signals,we ﬁrst proposed to utilize optimizable learningrates and proposed DLR to adapt to the fadingchannels. The problem was formulated under theMISO and MIMO scenarios. A closed-form solution and an iterative method were proposed respectivelyfor each case. The simulation results have validatedthe effectiveness of the proposed DLR in termsof both the MSE performance and the testingaccuracy on the MNIST and CIFAR10 datasets.Asymptotic analyses in the MISO, SIMO andMIMO scenarios were also provided to address theproblem of massive antenna deployment as well asto derive the theoretical bounds. On this basis, anear-optimal and closed-form receive beamformingdesign was proposed by simply summing up thechannel vectors. The feasibility and effectivenessof the proposal are veriﬁed by extensive numericalsimulations. R

EFERENCES [1] K. B. Letaief, W. Chen, Y. Shi, J. Zhang, and Y. A. Zhang,“The roadmap to 6G: AI empowered wireless networks,”

IEEECommun. Mag. , vol. 57, no. 8, pp. 84–90, Aug. 2019.[2] Y. Huang, S. Liu, C. Zhang, X. You, and H. Wu, “True-data testbed for 5G/B5G intelligent network,”

Intell. ConvergedNetworks , 2021, in press.[3] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wirelesssystems: Applications, trends, technologies, and open researchproblems,”

IEEE Network , vol. 34, no. 3, pp. 134–142, May2020.[4] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobileand wireless networking: A survey,”

IEEE Commun. Surv.Tutorials , vol. 21, no. 3, pp. 2224–2287, Mar. 2019.[5] C. Xu, S. Liu, C. Zhang, Y. Huang, Z. Lu, and L. Yang, “Multi-agent reinforcement learning based distributed transmission incollaborative cloud-edge systems,”

IEEE Trans. Veh. Technol. ,2021, in press.[6] S. Dang, O. Amin, B. Shihada, and M.-S. Alouini, “Whatshould 6G be?”

Nat. Electron. , vol. 3, no. 1, pp. 20–29, Jan.2020. [7] M. Chen, U. Challita, W. Saad, C. Yin, and M. Debbah,“Artiﬁcial neural networks-based machine learning for wirelessnetworks: A tutorial,” IEEE Commun. Surv. Tutorials , vol. 21,no. 4, pp. 3039–3071, Jul. 2019.[8] S. Bi, R. Zhang, Z. Ding, and S. Cui, “Wireless communicationsin the era of big data,”

IEEE Commun. Mag. , vol. 53, no. 10,pp. 190–199, Oct. 2015.[9] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang,“Toward an intelligent edge: Wireless communication meetsmachine learning,”

IEEE Commun. Mag. , vol. 58, no. 1, pp.19–25, Jan. 2020.[10] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wirelessnetwork intelligence at the edge,”

Proc. IEEE , vol. 107, no. 11,pp. 2204–2239, Oct. 2019.[11] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A.y Arcas, “Communication-efﬁcient learning of deep networksfrom decentralized data,” in

Proc. Int. Conf. Artif. Intell. Stat.(AISTATS) , Fort Lauderdale, FL, USA, Apr. 2017, pp. 1273–1282.[12] Y. Liu, X. Yuan, Z. Xiong, J. Kang, X. Wang, and D. Niyato,“Federated learning for 6G communications: Challenges,methods, and future directions,”

China Commun. , vol. 17, no. 9,pp. 105–118, Sept. 2020.[13] Z. Yang, M. Chen, K.-K. Wong, H. V. Poor, and S. Cui,“Federated learning for 6G: Applications, challenges, andopportunities,” 2021. [Online]. Available: https://arxiv.org/abs/2101.01338[14] P. M, S. P. R. M, Q.-V. Pham, K. Dev, P. K. R. Maddikunta,T. R. Gadekallu, and T. Huynh-The, “Fusion of federatedlearning and industrial internet of things: A survey,” 2021.[Online]. Available: https://arxiv.org/abs/2101.00798[15] Q. Zhou, S. Guo, H. Lu, L. Li, M. Guo, Y. Sun, and K. Wang,“Falcon: Addressing stragglers in heterogeneous parameterserver via multiple parallelism,”

IEEE Trans. Computers ,vol. 70, no. 1, pp. 139–155, Feb. 2021.[16] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, andH. Li, “Terngrad: Ternary gradients to reduce communicationin distributed deep learning,” in

Proc. Int. Conf. Neural Inf.Process. Syst. (NIPS) , Long Beach, CA, USA, Dec. 2017, pp.1508–1518.[17] D. Alistarh, D. Grubic, J. Z. Li, R. Tomioka, andM. Vojnovic, “QSGD: Communication-efﬁcient SGD viagradient quantization and encoding,” in

Proc. Int. Conf. NeuralInf. Process. Syst. (NIPS) , Long Beach, CA, USA, Dec. 2017,pp. 1707–1718.[18] Y. Lin, S. Han, H. Mao, Y. Wang, and B. Dally, “Deep gradientcompression: Reducing the communication bandwidth fordistributed training,” in

Proc. Int. Conf. Learn. Representations(ICLR) , Vancouver, BC, Canada, May 2018.[19] A. Aji and K. Heaﬁeld, “Sparse communication for distributedgradient descent,” in

Proc. Conf. Empirical Methods NaturalLanguage Process. (EMNLP) , Copenhagen, Denmark, Sept.2017, pp. 440–445.[20] T. Nishio and R. Yonetani, “Client selection for federatedlearning with heterogeneous resources in mobile edge,” in

Proc.IEEE Int. Conf. Commun. (ICC) , Shanghai, China, May 2019,pp. 1–7.[21] J. Ren, G. Yu, and G. Ding, “Accelerating DNN training inwireless federated edge learning systems,”

IEEE J. Sel. AreasCommun. , vol. 39, no. 1, pp. 219–232, Jan. 2021.[22] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Schedulingpolicies for federated learning in wireless networks,”

IEEETrans. Commun. , vol. 68, no. 1, pp. 317–333, Jan. 2020.[23] G. Zhu, J. Xu, K. Huang, and S. Cui, “Over-the-air computing for wireless data aggregation in massive IoT,” 2020. [Online].Available: https://arxiv.org/abs/2009.02181[24] M. Mohammadi Amiri and D. G¨und¨uz, “Machine learning atthe wireless edge: Distributed stochastic gradient descent over-the-air,”

IEEE Trans. Signal Process. , vol. 68, pp. 2155–2169,Mar. 2020.[25] M. M. Amiri and D. G¨und¨uz, “Federated learning over wirelessfading channels,”

IEEE Trans. Wireless Commun. , vol. 19, no. 5,pp. 3546–3557, May 2020.[26] G. Zhu, Y. Wang, and K. Huang, “Broadband analogaggregation for low-latency federated edge learning,”

IEEETrans. Wireless Commun. , vol. 19, no. 1, pp. 491–506, Jan.2020.[27] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learningvia over-the-air computation,”

IEEE Trans. Wireless Commun ,vol. 19, no. 3, pp. 2022–2035, Mar. 2020.[28] D. Yu, S. H. Park, O. Simeone, and S. S. Shitz, “Optimizingover-the-air computation in IRS-aided C-RAN systems,” in

Proc. IEEE 21st Int. Workshop Signal Process. AdvancesWireless Commun. (SPAWC) , Atlanta, GA, USA, May 2020,pp. 1–5.[29] W. Ni, Y. Liu, Z. Yang, H. Tian, and X. Shen, “Federatedlearning in multi-RIS aided systems,” 2020. [Online]. Available:https://arxiv.org/abs/2010.13333[30] G. B. Orr and K.-R. M¨uller,

Neural networks: Tricks of thetrade . Springer, 2003.[31] C. Darken, J. Chang, J. Moody et al. , “Learning rate schedulesfor faster stochastic gradient search,” in

Proc. IEEE WorkshopNeural Networks Signal Process. , Helsingoer, Denmark, Sept.1992, pp. 3–12.[32] S. Ruder, “An overview of gradient descent optimizationalgorithms,” 2016. [Online]. Available: https://arxiv.org/abs/1609.04747[33] L. N. Smith, “Cyclical learning rates for training neuralnetworks,” in

Proc. IEEE Winter Conf. Appl. Comput. Vision(WACV) , Santa Rosa, CA, USA, Mar. 2017, pp. 464–472.[34] G. Zhu and K. Huang, “Mimo over-the-air computation forhigh-mobility multimodal sensing,”

IEEE Internet Things J. ,vol. 6, no. 4, pp. 6089–6103, Aug. 2019.[35] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui,“A joint learning and communications framework for federatedlearning over wireless networks,” 2019. [Online]. Available:https://arxiv.org/abs/1909.07972[36] L. Chen, X. Qin, and G. Wei, “A uniform-forcing transceiverdesign for over-the-air function computation,”

IEEE WirelessCommun Lett. , vol. 7, no. 6, pp. 942–945, Dec. 2018.[37] Z. Luo, W. Ma, A. M. So, Y. Ye, and S. Zhang, “Semideﬁniterelaxation of quadratic optimization problems,”

IEEE SignalProcess. Mag. , vol. 27, no. 3, pp. 20–34, May 2010.[38] J. Y. Gotoh, A. Takeda, and K. Tono, “DC formulations andalgorithms for sparse optimization problems,”