[PDF] Edge Federated Learning Via Unit-Modulus Over-The-Air Computation (Extended Version)

Abstract

Edge federated learning (FL) is an emerging machine learning paradigm that trains a global parametric model from distributed datasets via wireless communications. This paper proposes a unit-modulus over-the-air computation (UM-AirComp) framework to facilitate efficient edge federated learning, which simultaneously uploads local model parameters and updates global model parameters via analog beamforming. The proposed framework avoids sophisticated baseband signal processing, leading to low communication delays and implementation costs. A training loss bound of UM-AirComp is derived and two low-complexity algorithms, termed penalty alternating minimization (PAM) and accelerated gradient projection (AGP), are proposed to minimize the nonconvex nonsmooth loss bound. Simulation results show that the proposed UM-AirComp framework with PAM algorithm not only achieves a smaller mean square error of model parameters' estimation, training loss, and testing error, but also requires a significantly shorter runtime than that of other benchmark schemes. Moreover, the proposed UM-AirComp framework with AGP algorithm achieves satisfactory performance while reduces the computational complexity by orders of magnitude compared with existing optimization algorithms. Finally, we demonstrate the implementation of UM-AirComp in a vehicle-to-everything autonomous driving simulation platform. It is found that autonomous driving tasks are more sensitive to model parameter errors than other tasks since the former neural networks are more sophisticated containing sparser model parameters.

Full PDF

aa r X i v : . [ c s . I T ] F e b Edge Federated Learning Via Unit-ModulusOver-The-Air Computation

Shuai Wang, Yuncong Hong, Rui Wang, Qi Hao, Yik-Chung Wu, and Derrick Wing Kwan Ng

Abstract

Edge federated learning (FL) is an emerging machine learning paradigm that trains a global para-metric model from distributed datasets via wireless communications. This paper proposes a unit-modulusover-the-air computation (UM-AirComp) framework to facilitate efﬁcient edge federated learning, whichsimultaneously uploads local model parameters and updates global model parameters via analog beam-forming. The proposed framework avoids sophisticated baseband signal processing, leading to lowcommunication delays and implementation costs. A training loss bound of UM-AirComp is derivedand two low-complexity algorithms, termed penalty alternating minimization (PAM) and acceleratedgradient projection (AGP), are proposed to minimize the nonconvex nonsmooth loss bound. Simulationresults show that the proposed UM-AirComp framework with PAM algorithm not only achieves asmaller mean square error of model parameters’ estimation, training loss, and testing error, but alsorequires a signiﬁcantly shorter runtime than that of other benchmark schemes. Moreover, the proposedUM-AirComp framework with AGP algorithm achieves satisfactory performance while reduces thecomputational complexity by orders of magnitude compared with existing optimization algorithms.Finally, we demonstrate the implementation of UM-AirComp in a vehicle-to-everything autonomousdriving simulation platform. It is found that autonomous driving tasks are more sensitive to modelparameter errors than other tasks since the former neural networks are more sophisticated containingsparser model parameters.

Index Terms

Analog beamforming, autonomous driving, federated learning, over-the-air computation, unit-modulus.

Shuai Wang is with the Department of Electrical and Electronic Engineering, the Department of Computer Science and Engineering, andthe Sifakis Research Institute of Trustworthy Autonomous Systems, Southern University of Science and Technology (SUSTech), Shenzhen518055, China (e-mail: [email protected]). Yuncong Hong and Rui Wang are with the Department of Electrical and ElectronicEngineering, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China (e-mail: [email protected];[email protected]). Qi Hao is with the Department of Computer Science and Engineering and the Sifakis Research Institute of TrustworthyAutonomous Systems, Southern University of Science and Technology (SUSTech), Shenzhen 518055, China (e-mail: [email protected]).Yik-Chung Wu is with the Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong (e-mail:[email protected]). Derrick Wing Kwan Ng is with the School of Electrical Engineering and Telecommunications, the University of NewSouth Wales, Australia (email: [email protected]).

I. I

NTRODUCTION

Deep learning has achieved unprecedented breakthrough in image classiﬁcation, speech recog-nition, and object detection due to its ability to efﬁciently extract intricate nonlinear features fromhigh-dimensional data [1]. Typically, deep learning techniques are deployed at a cloud center,which collects data from distributed users and trains a centralized model via gradient-basedback propagation [2]–[5]. However, since the users need to share their local data to the cloud,this paradigm could lead to some potential privacy issues, hindering the development of deeplearning in extensive applications such as smart cities and ﬁnancial systems.To address the privacy issue, federated learning (FL), which trains individual deep learningmodels at user terminals, has been proposed by Google Research [6], [7]. In the framework ofFL, the locally generated data is locally adopted and not shared to any third party. To leveragethe data from other users, local model parameters are uploaded periodically to a server for modelaggregation and the aggregated global parameters are then broadcast to the users for further localupdates. Therefore, FL achieves distributed training while ensuring data privacy [7].

A. Edge Federated Learning and Related Work

FL was originally developed for wire-line connected systems [7]. To achieve ubiquitousintelligence, a promising solution is edge FL, e.g., [8]–[11], [13]–[17], where users are connectedto an edge server via wireless links. However, the convergence of edge FL may take a long timedue to limited capacity of wireless channels during the uplink model aggregation step. To reducethe transmission delay, various edge FL designs have been proposed (summarized in Table I),which are mainly categorized into digital modulation [7]–[11] and analog modulation [13]–[17]methods.For digital modulation and single-antenna systems, data from different users are multiplexedeither in the time or the frequency domain. Current works on delay reduction focus on reducing1) the number of model aggregation iterations [8], 2) the number of users [9], or 3) the numberof bits for representing the gradient of back propagation in each iteration [11]. However, sincethese strategies involve approximation or simpliﬁcation of the FL procedure, the performance oflearning would be degraded inevitably. Another way to reduce the transmission delay is to adoptmultiple-input multiple-output (MIMO) technology for transmission so that data from multipleusers are multiplexed concurrently in the spatial domain [12].

TABLE I: A Comparison of Existing and Proposed Schemes.

Modulation Work MIMO RFChain Alg.Complex. Commun.Delay ObjectiveFunction AirComp FLTaskDigital [7] % + + +++ N/A % Classiﬁcation[8], [9] % + + ++ Loss Bound % Classiﬁcation[11] % + + ++ Loss Bound % Classiﬁcation[12] Digital +++ +++ ++ MSE % %

Analog [14], [15] % + + + Heuristic ! Classiﬁcation[16] Digital +++ +++ + MSE ! Classiﬁcation[17] Digital +++ ++ + Noise Variance ! Classiﬁcation

Ours

Analog + + + Loss Bound ! Object DetectionThe symbol “+” means low, “++” means moderate, “+++” means high.The symbol “ X ” means functionality supported, “ % ” means functionality not supported. On the other hand, the key advantage of analog modulation [13]–[17] over digital modula-tion arises from the ground-breaking idea of over-the-air computation (AirComp). Speciﬁcally,if multiple users upload their local parameters simultaneously, a superimposed signal, whichrepresents a weighted sum of individual model parameters, is observed at the edge server. Byperforming minimum mean square error (MMSE) detection on the superimposed signal, anestimate of the global parameter vector can be obtained [14], [15]. This signiﬁcantly saves thetransmission time since AirComp in fact exploits inter-user interference in the simultaneoususer transmission, in oppose to interference suppression as in digital modulation. However,due to channel fading and noise in wireless systems, AirComp employed in single-antennasystems [14], [15] could result in large error in the estimation of global model parameters atthe edge server, leading to slow convergence of FL iterations. As a remedy, adopting MIMObeamforming [16], [17] could reduce the parameter transmission error by aligning the beamscarrying the local parameters’ information to the same spatial direction. However, the currenttransmit and receive beamforming designs in MIMO AirComp systems involve exceedingly highradio frequency (RF) chain costs and high computational complexities [16], [17], preventing theirpractical implementation. Recently, it is shown in [18] that digital over-the-air computation can be achieved via exploiting the waveform-superpositionproperty. However, this will introduce additional quantization errors.

In practice, both digital and analog modulation methods share the same goal, i.e., minimizingthe training loss function. However, due to the lack of an explicit form of the training lossfunction with respect to wireless designs, most works focus on other related objective functionssuch as mean square error (MSE) [16] and noise variance [17]. Recently, the relationship betweenthe training loss function and the wireless designs is derived in [8]–[10]. Nonetheless, the boundsin [8]–[10] are only applicable if the global model parameters are perfectly broadcast to users.For practical cases involving errors in the model broadcast phase, new training loss bounds arerequired to capture the training performance, which remains an open problem.

B. Summary of Challenges and Contributions

In summary, despite recent exciting development of edge FL techniques, e.g., [7]–[9], [11],[12], [14]–[17], a number of technical challenges remain to be overcome, including1)

Reduction of MIMO implementation costs . Conventional beamforming designs for base-band signals [12], [16], [17] require one dedicated RF chain per antenna element, whichlead to high implementation cost and power consumption. To this end, analog beamformingwith a proper phase shift network design [19], [20] can help reduce such implementationcosts, which has not been studied in edge FL systems yet.2)

Reduction of beamforming design complexities.

Most beamforming algorithms [12], [16],[17] rely on the execution of the interior point method (IPM). Yet, since IPM involves matrixinversions, these algorithms are with high computational complexities requiring exceedinglylong signal processing delay, especially when massive MIMO technique is applied.3)

Veriﬁcation of robustness in more complex learning tasks . Existing algorithms in [7]–[9],[11], [12], [14]–[17] are mainly tested on simple image classiﬁcation tasks (e.g., recognitionof handwritten digits). Experiments on more complex and closer-to-reality tasks, such as3D object detection [21]–[23] in vehicle-to-everything (V2X) autonomous driving systems,are needed to verify the robustness of edge FL.To ﬁll the research gap, this paper proposes the unit-modulus AirComp (UM-AirComp)framework for edge FL in MIMO communication systems, as shown in Fig. 1. The UM-AirComp framework consists of multiple edge users with local sensing data, one edge server forperforming FL, and one communication interface for exchanging model parameters as shown inFig. 1a. Speciﬁcally, the edge users possess a number of training data (e.g., camera images orlight detection and ranging (LiDAR) point clouds of the environment with object boundaries or category labels) for local model training. Then, the trained model parameters are uploaded to theserver via analog modulation. To reduce the implementation cost of RF chains, the edge serverdoes not process the received model parameters at the baseband. Instead, it applies a phase shiftnetwork (either a fully-connected structure shown in Fig. 1b or a partially-connected structureshown in Fig. 1c) in the RF domain to connect received antennas and transmit antennas forglobal model updates. Upon receiving the broadcast, all users feed the received signals to ananalog demodulator for parameter extraction. The advantages of UM-AirComp and contributionsof this paper are summarized below:1) The UM-AirComp at the server signiﬁcantly reduces the required implementation cost of RFchains in MIMO FL systems, thereby reducing the hardware and energy costs. The trainingloss of UM-AirComp framework is proved to be upper bounded by a monotonic increasingfunction of the maximum MSE of the model parameters’ estimation. Experimental resultsconﬁrm the high accuracy of the derived bound.2) Despite the UM-AirComp problem being highly nonconvex, two large-scale optimizationalgorithms, termed penalty alternating minimization (PAM) and accelerated gradient pro-jection (AGP), are developed for fully-connected UM-AirComp and partially-connectedUM-AirComp, respectively. The learning performance of the proposed PAM and AGP isshown to outperform other benchmark schemes. In particular, the AGP algorithm is xfaster than that of the state-of-the-art second-order optimization algorithms.3) We implement the UM-AirComp edge FL scheme for 3D object detection with multi-vehicle point-cloud datasets in Car Learning to Act (CARLA) simulation platform [24]. Tothe best of our knowledge, this is the ﬁrst attempt that edge FL is demonstrated in a V2Xauto-driving simulator with a complex deep learning task.

C. Notation

Italic letters, lowercase and uppercase bold letters represent scalars, vectors, and matrices,respectively. Curlicue letters stand for sets and | · | is the cardinality of a set. The operators k · k , ( · ) T , ( · ) H , ( · ) − , λ max ( · ) , λ min ( · ) , Null( · ) , Rank( · ) are the ℓ -norm, transpose, Hermitian,inverse, largest eigenvalue, smallest eigenvalue, null space, and rank of a matrix, respectively.The operators ∂f and ∇ f are the partial derivative and the gradient of the function f . Thefunction [ x ] + = max( x, , Re( x ) takes the real part of x , Im( x ) takes the imaginary part of x , conj( x ) takes the conjugate of x , and | x | is the modulus of x . I N denotes the N × N identity Edge Server

Global Parameters

Edge User Edge User Local Model

Global ParametersLocal Parameters Local Parameters (a)

Fully-Connected Structure (b)

Partially-Connected Structure (c)

Fig. 1: a) Illustration of an edge FL system; b) fully-connected phase shift network structure;c) partially-connected phase shift network structure.matrix, N represents the N × all-ones vector, and A (cid:23) B represents A − B being positivesemideﬁnite. Finally, j = √− , E ( · ) denotes the expectation of a random variable and O ( · ) isthe big-O notation standing for the order of arithmetic operations.II. E DGE F EDERATED L EARNING WITH A IR C OMP

We consider an FL system shown in Fig. 1, which consists of an edge server equipped with N antennas and K single-antenna mobile users. The dataset and model parameter vector at user k are denoted as D k and x k ∈ R M × , respectively. Mathematically, the FL procedure aims tosolve the following optimization problem: min { x k } , θ P Kk =1 |D k | K X k =1 X d l ∈D k Θ( d l , θ ) | {z } :=Λ( θ ) s . t . x = · · · = x K = θ , (1)where Θ( d l , θ ) is the loss function corresponding to a single sample d l ( ≤ l ≤ |D k | ) in D k given parameter vector θ , while Λ( θ ) denotes the global loss function to be minimized. Thetraining of FL model parameters (i.e., solving (1)) in the considered edge system is naturally adistributed and iterative procedure, where each iteration involves two steps: 1) updating the localparameter vectors ( x , · · · , x K ) using {D , · · · , D K } at users (1 , · · · , K ) , respectively; and 2)aggregating ( x , · · · , x k ) in an analog manner at the edge server and notifying the users. Theabove two steps are further elaborated below.In the ﬁrst step, let x [ i ] k (0) ∈ R M × be the local parameter vector at user k at the beginningof the i -th iteration. To update x [ i ] k (0) , user k minimizes the loss function |D k | P d l ∈D k Θ( d l , x k ) via gradient descent as x [ i ] k ( τ + 1) = x [ i ] k ( τ ) − ε |D k | X d l ∈D k ∇ x Θ ( d l , x ) (cid:12)(cid:12) x = x [ i ] k ( τ ) , k = 1 , · · · , K, (2)where ε is the step-size and τ is from to E − with E being the number of local updates.Then, { x [ i ] k ( E ) |∀ k } from all users are uploaded to the edge server.In the second step, uplink model aggregation and downlink broadcast beamforming should bedesigned. Speciﬁcally, at the i -th iteration, user k modulates its local parameter vector x [ i ] k ( E ) into a complex vector s [ i ] k = Ω [ i ] k ( x [ i ] k ( E )) ∈ C S × , where Ω [ i ] k ( · ) denotes the modulation functionand S is the vector dimension. The received signal R [ i ] ∈ C N × S at the server is R [ i ] = K X k =1 h [ i ] k ( s [ i ] k ) T + Z [ i ] , (3)where h [ i ] k ∈ C N × is the uplink channel vector from user k to the server and Z [ i ] ∈ C N × S is thematrix of the additive white Gaussian noise with covariance matrix E (cid:2) vec( Z [ i ] )vec( Z [ i ] ) H (cid:3) = σ b I NS , where σ b is the noise power at the server. Upon receiving the superimposed signal, theserver processes R [ i ] using a function Ψ [ i ] ( · ) and broadcasts Ψ [ i ] ( R [ i ] ) ∈ C N × S to all the users.As a result, the received signal at user k is (cid:16) y [ i ] k (cid:17) T = (cid:16) g [ i ] k (cid:17) H Ψ [ i ] (cid:0) R [ i ] (cid:1) + (cid:16) n [ i ] k (cid:17) T , (4)where g [ i ] k ∈ C N × is the downlink channel vector from the server to user k , and n [ i ] k ∈ C S × is the vector of the additive white Gaussian noise with covariance matrix σ k I S , where σ k is thenoise power at user k . Finally, user k applies a demodulation function Ω [ i ] k ( · ) to y [ i ] k and setsthe local parameter vector for the ( i + 1) -th iteration as x [ i +1] k (0) = Ω [ i ] k ( y [ i ] k ) . Expanding theexpression of y [ i ] k in (4) yeilds x [ i +1] k (0) = Ω [ i ] k ( g [ i ] k ) H Ψ [ i ] K X k =1 h [ i ] k h Ω [ i ] k (cid:16) x [ i ] k ( E ) (cid:17)i T + Z [ i ] ! + ( n [ i ] k ) T ! . (5)Edge FL aims to design { Ω [ i ] k , Ω [ i ] k , Ψ [ i ] } such that E h k x [ i +1] k (0) − θ [ i ] k i is minimized, where θ [ i ] = K X k =1 |D k | P Kl =1 |D l | x [ i ] k ( E ) (6)is equivalent to the gradient descent of the objective function of (1). In existing AirCompschemes, e.g., [13]–[17], Ω [ i ] k is implemented using analog modulation, Ψ [ i ] is implemented using If |D k | is large, stochastic gradient descent can be adopted to accelerate the training speed. a MIMO transceiver, and Ω [ i ] k is implemented using digital demodulation. Thus, the superimposedsignals of all receive antennas at the edge server are ﬁrst combined via a vector in the digitalbaseband and then broadcast to users via another vector in the baseband. Therefore, the requirednumber of RF chains at the edge server should equal the number of antennas. Note that theassociated implementation cost could be high if massive number of antennas are deployed.In the following section, the UM-AirComp scheme, where the functions { Ω [ i ] k , Ω [ i ] k , Ψ [ i ] } arealmost implemented in analog domain, is proposed. This helps to reduce both the requiredimplementation costs of RF chains and power consumption while achieving excellent systemperformance. III. P ROPOSED

UM-A IR C OMP E DGE

FL F

RAMEWORK

A. UM-AirComp Framework

The proposed UM-AirComp scheme consists of: 1) analog modulation designs for local modelupload at all users; 2) analog beamforming at the edge server which employs phase shifters(unit-modulus) to generate updated global model parameters from the local ones; 3) analogdemodulation designs at all users for global model reconstruction. The three components aredetailed as follows.

1) Structure of analog modulation Ω [ i ] k . The deep learning model parameters are transmittedin an analog manner as in [13]–[15]. Speciﬁcally, since the model parameters in deep learningare real-valued numbers, in order to reduce the transmission time, every two model parametersare transmitted as a complex number. That is, s [ i ] k = Ω [ i ] k (cid:16) x [ i ] k ( E ) (cid:17) = q p [ i ] k exp(j ϕ [ i ] k ) p η [ i ] h x [ i ] k, ( E ) + j x [ i ] k, ( E ) , · · · , x [ i ] k,M − ( E ) + j x [ i ] k,M ( E ) i T , (7)where s [ i ] k ∈ C S × with S = M/ and x [ i ] k,m ( E ) is the m -th element of x [ i ] k ( E ) . The scalingfactor η [ i ] is η [ i ] = K P Kk =1 η [ i ] k with η [ i ] k = M k x [ i ] k ( E ) k such that the average power of s [ i ] k is S E h k s [ i ] k k i = p [ i ] k . The transmit power p [ i ] k ∈ R satisﬁes p [ i ] k ≤ P with P being the maximumtransmit power at each user. The transmit phase ϕ [ i ] k ∈ [0 , π ] is adopted to align the phase of Each user ﬁrst sends the value of average energy of model parameters to the server for the computation of η [ i ] . Then, theserver sends η [ i ] to all users. In practice, the users would adopt η [ i − from the last iteration, which is a good approximationfor the actual η [ i ] in the current iteration. user k with those of other users. To facilitate the subsequent derivations, we deﬁne the transmitcoefﬁcient t [ i ] k = q p [ i ] k exp(j ϕ [ i ] k ) for all ( i, k ) and { p [ i ] k , ϕ [ i ] k } can be recovered from { t [ i ] k } .

2) Structure of unit-modulus beamforming Ψ [ i ] . The computation function Ψ [ i ] adopted atthe edge server weighs the amplitude and phase of the signal R [ i ] . We propose to aggregate andforward the signal via Ψ [ i ] ( R [ i ] ) = √ γ F [ i ] R [ i ] , (8)where γ > is the power scaling factor at the edge server and F [ i ] ∈ C N × N is the phase shiftmatrix. The implementation of F [ i ] can be either fully-connected or partially-connected [19],[20]. For the former structure, each receive antenna element is connected to all transmit antennaelements via a network of phase shifters as shown in Fig. 1b. This results in unit-modulusconstraints on all elements of the analog beamforming matrices { F [ i ] } , i.e., | F [ i ] l,l ′ | = 1 , ∀ l, l ′ , (9)and the required number of phase shifters is N . For the latter structure, each receive antennaelement is combined via a phase shift vector and then connected to transmit antenna elementsvia another phase shift vector as shown in Fig. 1c. Hence, apart from the unit-modulus constraint(9), F [ i ] should also satisfy Rank (cid:0) F [ i ] (cid:1) = 1 . Note that the required number of total phase shiftersis only N .

3) Structure of analog demodulation Ω [ i ] k . The demodulation function Ω [ i ] k maps the receivedsignal y [ i ] k to x [ i +1] k (0) . Since Ω [ i ] k is the reverse operation of { Ω [ i ] k } , the proposed structure for { Ω [ i ] k } is written as x [ i +1] k (0) = Ω [ i ] k ( y [ i ] k ) = p η [ i ] h Re( r [ i ] k y [ i ] k, ) , Im( r [ i ] k y [ i ] k, ) , · · · , Re( r [ i ] k y [ i ] k,L ) , Im( r [ i ] k y [ i ] k,L ) i T , (10)where r [ i ] k ∈ C is a normalization coefﬁcient applied to y [ i ] k and y [ i ] k,l is the l th element of y [ i ] k . Remark 1 (Duplex Mode):

The proposed UM-AirComp framework can be applied to timedivision duplex (TDD) systems, where the uplink is separated from the downlink in time by theallocation of different time slots with the same frequency band. For the fully-connected structure,such a time allocation can be implemented via the direct RF sampling technique [25] as shownin Fig. 1b, which achieves signiﬁcantly lower costs than baseband sampling and computation.For the partially-connected structure, such a time allocation can be implemented at the basebandwith only one RF chain, since the phase shifted signals are combined into one stream in the RFdomain before down conversion as shown in Fig. 1c. B. MSE and Training Loss Analysis

Due to the noise and interference in analog transmission of model parameters, the training lossof Λ( x [ i +1] k (0)) under UM-AirComp scheme would be greater than Λ( θ ∗ ) in wired FL systems,where θ ∗ denotes the optimal solution of θ to (1). Hence, we can use the expectation of theirdifference, namely E h Λ( x [ i +1] k (0)) − Λ( θ ∗ ) i , as a metric to capture the degradation on trainingloss, which depends on the MSE of the model parameters’ estimation. Before establishing therelation between the training loss and MSE, we ﬁrst derive the expression of MSE and introducethe assumption imposed on loss function. Based on (6) and (10), the MSE between the receivedlocal parameter x [ i +1] k and the target local parameter θ [ i ] at the ( i + 1) -th FL iteration is MSE [ i ] k (cid:16) F [ i ] , t [ i ] k , r [ i ] k (cid:17) = E (cid:20)(cid:13)(cid:13)(cid:13) x [ i +1] k (0) − θ [ i ] (cid:13)(cid:13)(cid:13) (cid:21) = 2 η [ i ] S " γ K X j =1 (cid:12)(cid:12)(cid:12) r [ i ] k ( g [ i ] k ) H F [ i ] h [ i ] j t [ i ] j − α j (cid:12)(cid:12)(cid:12) + γσ b k r [ i ] k ( g [ i ] k ) H F k + σ k | r [ i ] k | , (11)where α k = |D k | P Kl =1 |D l | and the equality is due to (7), (10), and the independence among { s k |∀ k } . Assumption 1. (i) The function Λ( x ) is µ -strongly convex. (ii) The function P d l ∈D k Θ( d l , x ) istwice differentiable and satisﬁes P d l ∈D k ∇ x Θ( d l , x ) (cid:22) L I . Under Assumption 1, the relationship between Λ( x [ i +1] k (0)) and Λ( θ ∗ ) is summarized in thefollowing theorem. Theorem 1.

With ( ε, E ) = ( P Kk =1 |D k | KL , , the UM-AirComp scheme satisﬁes E h Λ( x [ i +1] k (0)) − Λ( θ ∗ ) i ≤ i X i ′ =0 A [ i ′ ] max k =1 , ··· ,K MSE [ i ′ ] k , (12) for any { F [ i ′ ] , r [ i ′ ] k , t [ i ′ ] k } ii ′ =1 as i → + ∞ , where A [ i ′ ] = KL ( K − ) P Kk =1 |D k | (cid:16) − µ P Kk =1 |D k | KL (cid:17) i − i ′ .Proof. See Appendix A.Theorem 1 shows a diminishing A [ i ′ ] → for a large i − i ′ , meaning that the impact from earlierFL iterations vanishes as the edge FL continues. On the other hand, if MSE [ i ′ ] k → for all k , then Λ( x [ i +1] k ) is an unbiased estimate of Λ( θ ∗ ) . This demonstrates the effectiveness of UM-AirCompin the asymptotic region. The convexity and smoothness in Assumption 1 have been adopted inmost loss bound analysis of FL (e.g., [26], [40]). Although it seems to be restrictive for realisticapplications, analysis under Assumption 1 could provide important insights of the behavior of MSE -1 T r a i n i ng Lo ss Upper Bound of Example 1Actual Loss of Example 1Upper Bound of Example 2Actual Loss of Example 2

Fig. 2: Comparison between the actual training loss and loss bound at

MSE [ i ′ ] k =(1 , , , , when K = 10 and |D k | = 100 after FL iterations.UM-AirComp in nonconvex cases. In particular, a nonconvex function can be locally convexwithin the neighborhood of the optimal solution θ ∗ of (1). Therefore, if the learning model ispre-trained such that x [0] k is in the neighborhood of θ ∗ , then the analysis based on Assumption 1is also valid for nonconvex loss functions. On the other hand, if the pre-trained model parametersare not in the neighborhood of θ ∗ , the result can be treated as an approximation of the upperbound with L ∝ ε − and µ ∝ ( P Kk =1 |D k | ) − .To verify the correctness of Theorem 1, the case of K = 10 and |D k | = 100 is simulated.Since the result in Theorem 1 is valid for any design of { F [ i ′ ] , r [ i ′ ] k , t [ i ′ ] k } , for the purpose ofdemonstration, we set MSE [ i ′ ] k = (1 , , , , at all users for i ′ = (1 , · · · , .Then, we consider the following examples: • Example 1: Linear regression . The dataset D k consists of input-output pairs d l = ( d in l , d out l ) to be ﬁtted, where the input vector d in l ∈ R M × is generated from d in l ∼ N ( , . I ) and the output value d out l ∈ R is generated from d out l ∼ N ( T d in l , . The loss function Θ( d l , x k ) = ( x Tk d in l − d out l ) / . The parameters L can be computed as the largest eigenvalueof matrix P d l ∈D k d l d Hl for all k and µ can be computed as the smallest eigenvalue of matrix ( P Kk =1 |D k | ) − P d l ∈{D , ··· , D K } d l d Hl . The training step-size is set to ε = 1 /L . • Example 2: Image classiﬁcation via a convolutional neural network (CNN) . The mixednational institute of standards and technology (MNIST) dataset is used, where d in l ∈ R × is a gray-scale vector of images and d out l ∈ R × is a label vector containing only one non-zero element. The CNN consists of two convolution layers, two max pooling layers, and a fully-connected layer. Denoting f cnn (cid:0) x k , d in l (cid:1) as the softmax output of CNN given parametervector x k and the input vector d in l , the loss function Θ( d l , x k ) = (cid:2) f cnn (cid:0) x k , d in l (cid:1) − d out l (cid:3) / .The training step-size is set to ε = 1 . The parameter L and µ are set to L = 0 . ε/K and µ = 0 . P Kk =1 |D k | ) − .It can be seen from Fig. 2 that the training loss after FL iterations is indeed upper boundedby the expression derived in Theorem 1. Moreover, the bound is tight for Example 1 due to itsconvex loss function, while it is slightly looser for Example 2 due to its nonconex loss function.Note that no matter which task is considered, the upper bound derived in Theorem 1 matchesthe trend of the actual loss very well.

C. Problem Formulation

Ideally, the optimization of { F [ i ] , r [ i ] k , t [ i ] k } should be performed to minimize the training error,i.e., min Λ( x [ i +1] k (0)) . However, the analytical expression of E h Λ( x [ i +1] k (0)) i , where the expec-tation is taken over receiver noises and model parameters, is usually challenging to derive. Asa compromise approach, we aim to minimize its upper bound Λ( θ ∗ ) + P ii ′ =0 A [ i ′ ] max k MSE [ i ′ ] k obtained from Theorem 1. Thus, the minimization of training loss in edge FL is formulated as min { F [ i ′ ] ,r [ i ′ ] k ,t [ i ′ ] k } i X i ′ =0 A [ i ′ ] max k =1 , ··· ,K MSE [ i ′ ] k (cid:16) F [ i ′ ] , r [ i ′ ] k , t [ i ′ ] k (cid:17) (13a) s . t . F [ i ′ ] ∈ F , i ′ = 0 , · · · , i, (13b) | t [ i ′ ] k | ≤ P , k = 1 , · · · , K, i ′ = 0 , · · · , i. (13c)The constraint (13b) is the beamforming constraints at the server, with the feasible set F =  { F : | F l,l ′ | = 1 , ∀ l, l ′ } , Fully connected { F : Rank ( F ) = 1 , | F l,l ′ | = 1 , ∀ l, l ′ } , Partially connected . (14)The constraint (13c) is the power constraints at users obtained from | t k | ≤ P . It can be seenthat the above problem and constraints can be decoupled for each iteration and the minimizationat the i -th FL iteration, ∀ i , is given by P : min F , { r k ,t k } max k =1 , ··· ,K " γ K X j =1 (cid:12)(cid:12)(cid:12) r k g Hk Fh j t j − α j (cid:12)(cid:12)(cid:12) + γσ b k r k ( g k ) H F k + σ k | r k | MSE k / (2 ηS ) (15a) s . t . F ∈ F , (15b) | t k | ≤ P , k = 1 , · · · , K, (15c)where the FL iteration index i ′ and A [ i ′ ] are omitted. This is because A [ i ′ ] is a constant given thetuple of system parameters ( K, P k |D k | , µ, L ) in a particular FL iteration. The term MSE k / (2 ηS ) is the normalized MSE with respect to the power of each parameter. It can be seen that the keyto minimize the training loss upper bound of UM-AirComp is to minimize the maximum MSE instead of the average MSE. Remark 2 (Challenges of Solving P ): Problem P is NP-hard due to the unit-modulus con-straints [19], [20]. In addition, the coupling between variables { r k } , { t k } , and F makes theproblem nonlinear and nonconvex. Furthermore, the large dimensions of F and { r k , t k } call forthe design of low-complexity algorithms in the scenario with massive numbers of antennas andusers.IV. P ENALTY A LTERNATING M INIMIZATION FOR F ULLY -C ONNECTED

UM-A IR C OMP

In this section, the UM-AirComp with fully-connected structure is studied, where the feasibleset F is given in the ﬁrst line of (14). A PAM algorithm with two layers of iterations, i.e.,an outer-layer iteration and an inner-layer iteration, will be proposed to optimize the systemperformance. Below we ﬁrst introduce the outer-layer iteration. A. Outer-Layer Iteration

To resolve the coupling between variables { r k } , { t k } , and F , this paper adopts an alternatingoptimization framework [28], which optimizes one design variable at a time with others beingﬁxed. Starting with an initial solution { F (0) , r (0) k , t (0) k } , the entire procedure solving the problem P for the ( n + 1) -th outer iteration, ∀ n , can be elaborated below: F ( n +1) = arg min F max k =1 , ··· ,K K X j =1 (cid:12)(cid:12)(cid:12) r ( n ) k g Hk Fh j t ( n ) j − α j (cid:12)(cid:12)(cid:12) + σ b k r ( n ) k g Hk F k ! s . t . | F l,l ′ | = 1 , l = 1 , · · · , N, l ′ = 1 , · · · , N, (16a) { r ( n +1) k } = arg min { r k } max k =1 , ··· ,K γ K X j =1 (cid:12)(cid:12)(cid:12) r k g Hk F ( n +1) h j t ( n ) j − α j (cid:12)(cid:12)(cid:12) + γσ b k r k g Hk F ( n +1) k + σ k | r k | ! , (16b) { t ( n +1) k } = arg min { t k } max k =1 , ··· ,K K X j =1 (cid:12)(cid:12)(cid:12) r ( n +1) k g Hk F ( n +1) h j t j − α j (cid:12)(cid:12)(cid:12) s . t . | t k | ≤ P , k = 1 , · · · , K, (16c)where { F ( n ) , t ( n ) k , r ( n ) k } is the solution at the n -th outer iteration. The iterative procedure stopsuntil n reaches the maximum iteration number n = N max .Problem (16a) can be transferred to a convex problem via semideﬁnite relaxation (SDR)while problems (16b) and (16c) are convex. Hence, problems (16a)–(16c) can all be solved viaCVX, a Matlab software package for solving convex problems based on interior point method(IPM). According to [30], the computational complexity is at least O ( N ) for solving (16a)(the vectorization of F involves N variables) and O ( K . ) for solving (16b)–(16c). For large N and K , this method is not desirable. In the following, a new algorithm termed PAM, whichdecomposes (16a)–(16c) into smaller subproblems that are either solved by gradient updatesor closed-form updates, is proposed for achieving both excellent performance and signiﬁcantlylower computational complexities. B. Inner-Layer Iteration1) Optimization of F : Since F is a matrix, its vectorization is given as f = vec ( F ) ∈ C N × .Applying Tr (cid:0) AXBX T (cid:1) = vec( X ) T (cid:0) B T ⊗ A (cid:1) vec( X ) [27], we have r ( n ) k g Hk Fh j t ( n ) j = r ( n ) k t ( n ) j (cid:0) h Tj ⊗ g Hk (cid:1) vec ( F ) = ( a ( n ) k,j ) H f , (17) σ b k r ( n ) k g Hk F k = | r ( n ) k | Tr (cid:0) g k g Hk FI N F H (cid:1) = f H G ( n ) k f , (18)where G ( n ) k = σ b | r ( n ) k | I N ⊗ (cid:0) g k g Hk (cid:1) and a ( n ) k,j = h r ( n ) k t ( n ) j (cid:0) h Tj ⊗ g Hk (cid:1)i H . Problem (16a) is thusre-formulated as P F : min f max k =1 , ··· ,K K X j =1 (cid:12)(cid:12)(cid:12) ( a ( n ) k,j ) H f − α j (cid:12)(cid:12)(cid:12) + f H G ( n ) k f ! s . t . | f l | = 1 , l = 1 , · · · , N . (19)To handle the nonseparable objective function, variable splitting of f is proposed such that f = u = · · · = u K , where { u k } are auxilliary variables. Moreover, to handle the unit-modulusconstraints, another auxilliary variable z = f is introduced. For all the newly introduced equalityconstraints, they can be transformed into quadratic penalties in the objective function [29]. Asa result, P F is approximately transformed into min f , z , { u k } max k =1 , ··· ,K K X j =1 (cid:12)(cid:12)(cid:12) ( a ( n ) k,j ) H u k − α j (cid:12)(cid:12)(cid:12) + u Hk G ( n ) k u k ! + ρ K K X j =1 k u j − f k + k z − f k ! s . t . | z l | = 1 , l = 1 , · · · , N , (20)where ρ is a tuning parameter. It can be proved that P F and (20) are equivalent problems as ρ → + ∞ [29]. However, this case also leads to the gradient norm of the objective functionof (20) being inﬁnite, making (20) difﬁcult to solve. Therefore, ρ controls the tradeoff betweenapproximation error and difﬁculty in solving (20).We address (20) using alternating minimization, in which the cost function is iterativelyminimized with respect to one variable whereas the others are ﬁxed. Starting with an initial f (0) = z (0) = u (0) k,j = vec (cid:0) F ( n ) (cid:1) , the whole process consists of iteratively solving u ( m +1) k = arg min u k K X j =1 (cid:12)(cid:12)(cid:12) ( a ( n ) k,j ) H u k − α j (cid:12)(cid:12)(cid:12) + u Hk G ( n ) k u k + ρK k u k − f ( m ) k , ∀ k, (21a) f ( m +1) = arg min f ρ K K X j =1 k u ( m +1) j − f k + k z ( m ) − f k ! , (21b) z ( m +1) = arg min | z l | =1 , ∀ l ρ k z − f ( m +1) k , (21c)where m is the inner iteration index. It can be veriﬁed that the objective function of (20)is strongly convex. Therefore, despite the non-differentiability of the objective, the alternatingminimization (21a)–(21c) is guaranteed to converge to a stationary point of (20) [28]. The iterativeprocedure stops until m reaches the maximum iteration number m = M max .The remaining question is how to solve (21a)–(21c) optimally. We notice that problems (21a)and (21b) are standard least squares problems, thus their solutions are given by the followingclosed-form expressions u ( m +1) k = K X j =1 a ( n ) k,j ( a ( n ) k,j ) H + G ( n ) k + ρ I ! − K X j =1 α j a ( n ) k,j + ρK f ( m ) ! , (22) f ( m +1) = 12 K K X j =1 u ( m +1) j + z ( m ) ! , (23)respectively. On the other hand, problem (21c) is the projection of f ( m +1) onto unit-modulusconstraint and the optimal solution is simply z ( m +1) = exp (cid:0) j ∠ f ( m +1) (cid:1) . (24)

2) Optimization of { r k } : The problem of optimizing { r k } in (16b) is also a least squaresproblem. The optimal solution is found by setting the derivative ∂ MSE k /∂ conj( r k ) to zero: ∂ MSE k ∂ conj( r k ) = γ K X j =1 (cid:16) r k g Hk F ( n +1) h j t ( n ) j − α j (cid:17) conj (cid:16) g Hk F ( n +1) h j t ( n ) j (cid:17) + γσ b k g Hk F ( n +1) k r k + σ k r k = 0 (25)which yields r ( n +1) k = P Kj =1 α j conj (cid:16) g Hk F ( n +1) h j t ( n ) j (cid:17)P Kj =1 | g Hk F ( n +1) h j t ( n ) j | + σ b k g Hk F ( n +1) k + σ k /γ . (26)

3) Optimization of { t k } : The objective function of problem (16c) is not separable. Followingsimilar variable splitting procedure as in (20), we introduce auxiliary variables { ξ k,j = t k , ∀ k, j } and add quadratic penalty ρ P Kk =1 P Kj =1 | ξ k,j − t k | to the objective function. The problem (16c)is approximately transformed into P t : min { t k ,ξ k,j } ρ K X k =1 K X j =1 | ξ k,j − t j | + max k =1 , ··· ,K K X j =1 (cid:12)(cid:12)(cid:12) r ( n +1) k g Hk F ( n +1) h j ξ k,j − α j (cid:12)(cid:12)(cid:12) s . t . | t k | ≤ P , k = 1 , · · · , K, (27)and variables t k and ξ k,j can be optimized iteratively. In particular, starting from t (0) k = ξ (0) k,j = t ( n ) k ,the solution of (27) can be obtained iteratively, and the q -th iteration is given by ξ ( q +1) k,j = arg min ξ k,j (cid:12)(cid:12)(cid:12) r ( n +1) k g Hk F ( n +1) h j ξ k,j − α j (cid:12)(cid:12)(cid:12) + ρ | ξ k,j − t ( q ) j | , ∀ k, j, (28) t ( q +1) j = arg min | t j | ≤ P ρ K X k =1 | ξ ( q +1) k,j − t j | , ∀ j. (29)Problem (28) is a least squares problem and (29) is a quadratic problem with only one constraint.They can be solved optimally based on Karush-Kuhn-Tucker (KKT) conditions and the solutionsare given by ξ ( q +1) k,j = conj( r ( n +1) k g Hk Fh j ) α j + ρt ( q ) j | r ( n +1) k g Hk F ( n +1) h j | + ρ , ∀ k, j, (30) t ( q +1) j = p min ( P , K P Kk =1 ξ ( q +1) k,j (cid:12)(cid:12)(cid:12) K P Kj =1 ξ ( q +1) k,j (cid:12)(cid:12)(cid:12) , ∀ j. (31) C. Summary and Complexity Analysis of PAM

In summary, the complete PAM algorithm for solving problem P with a fully-connectedstructure consists of two layers of iterations. Let N max and M max denote maximum number ofiterations for outer and inner layers. In the outer layer, the PAM optimizes F , { r k } and { t k } alternatively in each of the N max iterations. In the inner layer, F is obtained via computing(22)–(24) for M max iterations, { r k } is obtained via computing (26), and { t k } is obtained viacomputing (30)–(31) for M max iterations. The computational complexities for these equationsare O ( KN ) , O ( KN ) , O ( N ) , O ( KN ) , O ( K ) , O ( K ) for (22), (23), (24), (26), (30), (31),respectively. Since the computation is dominant by (22)–(24) with a complexity of O ( KN ) ,the total computational complexity of PAM is O ( N max M max KN ) .V. A CCELERATED G RADIENT P ROJECTION FOR P ARTIALLY -C ONNECTED

UM-A IR C OMP

In practice, it is possible that there are a large number of antennas at the edge server. Insuch a case, a smaller number of phase shifters than N at the edge server is desirable. Tothis end, this section proposes an accelerated gradient projection method for partially-connectedUM-AirComp, which only needs N phase shifters.With a partially-connected structure as illustrated in Fig. 1c, the feasible set F equals thesecond line of (14). Since the rank of F is , we can apply rank-one decomposition on F whichyields F = vw H . Then, we adopt the following approximations to P : 1) Set r k = 1 / ( g Hk v ) and t k = 1 / ( h Hk w ) ; 2) Replace {| F l,m | = 1 |∀ l, m } by k F k ≤ N . After the above steps, problem P is simpliﬁed into a bilevel form: Q : min v max k =1 , ··· ,K σ k | g Hk v | s . t . N k v k ≥ min w (cid:26) k w k : | w H h k | ≥ α k P , ∀ k (cid:27) . (32)Since the right hand side of the constraint in (32) is a quadratic optimization problem, it can besolved by the accelerated random coordinate descent method with a complexity of O ( KN ) [33].Denoting the solution of w using accelerated random coordinate descent method as w = w ⋄ ,problem Q is reduced to Q : min v max k =1 , ··· ,K σ k | g Hk v | s . t . k v k ≤ β, (33)where β = N k w ⋄ k . In the following, we shall propose an efﬁcient ﬁxed point method for solvingproblem Q . A. Fixed-Point Iteration

The major challenge in solving problem Q is the large dimension of variables and the largenumber of elements inside the maximum operator. To this end, we ﬁrst re-write Q as a bilevelproblem min k v k ≤ β max k =1 , ··· ,K σ k | g Hk v | ⇐⇒ min k v k ≤ β max k =1 , ··· ,K − | g Hk v | σ k ⇐⇒ min k v k ≤ β max b ∈ ∆ − K X k =1 b k | g Hk v | σ k | {z } := h ( v , b ) , (34)where ∆ = { b | b (cid:23) , T b = 1 } and the last step is used to smooth the objective functionvia introducing one more auxiliary optimization variable b [36]. Then, we have the followingconclusion on the Karush-Kuhn-Tucker solution to (34), which also holds for Q . Lemma 1.

Let U ( v ′ ) = √ β C ( v ′ ) arg min b ∈ ∆ Φ ( v ′ , b ) (cid:13)(cid:13) C ( v ′ ) arg min b ∈ ∆ Φ ( v ′ , b ) (cid:13)(cid:13) , (35) where Φ ( v ′ , b ) = 2 p β (cid:13)(cid:13) C ( v ′ ) b (cid:13)(cid:13) − [ q ( v ′ )] T b , (36) C ( v ′ ) = (cid:20) g g H v ′ σ , · · · , g K g HK v ′ σ K (cid:21) ∈ C N × K , (37) q ( v ′ ) = (cid:20) | g H v ′ | σ , · · · , | g HK v ′ | σ K (cid:21) T ∈ C K × . (38) Then with any feasible v (0) and ﬁxed point iteration v ( n +1) ← U ( v ( n ) ) , every limit point v ⋄ ofthe sequence { v (0) , v (1) , ... } is a Karush-Kuhn-Tucker solution to problem (34) .Proof. See Appendix B.Although Lemma 1 reveals the solution structure of (34), computation of U ( v ′ ) is not straight-forward as it involves another optimization problem of b , which should also be solved with lowcomputation costs. In the following, we will solve the optimization problem of b via the methodsof smoothing and acceleration. B. Optimization of b via Smoothing and Acceleration In order to compute U ( v ′ ) , a necessary step is to ﬁnd the optimal vector b ∗ = arg min b ∈ ∆ Φ ( b ) , (39) where we have omitted the symbol v ′ in (36) since v ′ is a known and ﬁxed vector in eachiteration. Notice that the gradient of the objective ∇ b Φ( b ) = 2 √ β Re (cid:0) C H Cb (cid:1) k Cb k − q (40)is unbounded when k Cb k → , which happens if b ∈ Null( C ) . Therefore, it is nontrivialto apply ﬁrst-order method to problem (39). To avoid the unbounded gradients, we adopt thesmoothing technique [37] to replace Φ( b ) in (39) with Ξ( b ) =2 p β × q φ + (cid:13)(cid:13) Cb (cid:13)(cid:13) − q T b , (41)where the tuning parameter φ ≥ such that Ξ( b ) = Φ( b ) for φ = 0 . Then problem (39) can beapproximated by Q : min b ∈ ∆ Ξ ( b ) . (42)In the following, we ﬁrst elaborate the optimal solution of problem Q , and then establish therelation between the solutions of problems Q and (39). First of all, we have the followinglemma on the objective of problem Q . Lemma 2. Ξ( b ) is Lipschitz smooth for b ∈ ∆ , with the Lipschitz constant of gradient L Ξ ( φ ) = 2 √ β λ max (cid:2) Re (cid:0) C H C (cid:1)(cid:3)p φ + λ min ( C H C ) /K . (43) Proof.

See Appendix C.Lemma 2 shows that Q is a Lipschitz smooth problem. As a result, the acceleration method[38], [39] can be adopted to optimally solve Q iteratively. The algorithm is summarized inTheorem 2. Theorem 2.

Let b (0) ∈ ∆ and b ( m + 1) = Π ∆ h ρ ( m ) − L Ξ ( φ ) ∇ b Ξ( b ) (cid:12)(cid:12)(cid:12) b = ρ ( m ) i , (44) where m is the iteration index, Π ∆ is the projection onto set ∆ , L Ξ ( φ ) is deﬁned in (43) , and ∇ b Ξ( b ) = q − p β × Re (cid:0) C H Cb (cid:1)p φ + k Cb k , (45) ρ ( m ) = b ( m ) + c ( m − − c ( m ) ( b ( m ) − b ( m − , (46) c ( m ) = 12 (cid:18) q c ( m − (cid:19) , c (0) = 1 . (47) Then the sequence computed from (44) – (47) converges to the optimal solution of Q with aniteration complexity O (cid:16)p L Ξ ( φ ) /ǫ (cid:17) , where ǫ is the target accuracy.Proof. It can be proved by following a similar approach in [39, Theorem 4.4].Notice that the iteration complexity touches the lower bound derived in [40, Theorem 2.1.6].The computation of the projection Π ∆ ( u ) given the input vector u is summarized in Lemma 3. Lemma 3.

Let u ′ = sort( u ) , where the function sort permutes the elements of u in a descentorder such that u ′ ≥ · · · ≥ u ′ K and δ = max x ∈{ , ··· ,K } n x : P xl =1 u ′ l − x < u ′ x o . Then Π ∆ ( u ) = u − P δl =1 u ′ l − δ ! + . (48) Proof.

Please refer to [41, Proposition 2.2] and is omitted for brevity.Finally, we have the following conclusion on the relation between the solutions to problems Q and (39). Theorem 3. (i) If

Rank ([ g , · · · , g K ]) = K (thus L Ξ (0) < + ∞ ), the optimal solution toproblem Q is optimal to problem (39) by setting φ = 0 . (ii) If Rank ([ g , · · · , g K ]) = K , then L Ξ (0) = + ∞ , and L Ξ ( φ ) < + ∞ if φ > . (iii) For all b ′ ∈ ∆ with Ξ( b ′ ) − Ξ( b ⋄ ) ≤ ǫ , Φ( b ′ ) − Φ( b ∗ ) ≤ p β φ + ǫ, (49) where b ⋄ and b ∗ denote the optimal solutions to Q and (39) , respectively.Proof. See Appendix D.Part (i) of Theorem 3 indicates that we can always set φ = 0 if the user channels areindependent. In this case, Φ( b ) = Ξ( b ) , which means that the optimal solution to problem Q is the same as that of (39). On the other hand, part (ii) of Theorem 3 indicates that if theuser channels are correlated, we must choose φ > , and the conversion from (39) to Q wouldlead to approximation error. However, this error is controllable by choosing a small √ β φ according to part (iii) of Theorem 3 (e.g., with φ = 0 . / (2 √ β ) , the approximation error is √ β φ = 0 . ). C. Summary and Complexity Analysis of AGP

For the proposed AGP algorithm, the accelerated random coordinate descent is ﬁrst used tocompute w ⋄ for problem on the right hand side of the constraint in (32), which requires acomplexity of O ( KN ) . To optimize v ⋄ , in each ﬁxed-point iteration, the terms C in (37) and q in (38) are computed with a complexity of O ( KN ) , followed by the iterative calculationof variable b in (39) with equations (44)–(47), which involves a complexity of O ( KN ) forgradient computation. Therefore, the overall complexity of AGP for solving Q is O ( KN ) .Notice that with the obtained w ⋄ and v ⋄ , we need to recover { F ⋆ , r ⋆k , t ⋆k } . To satisfy the unit-modulus constraints, w ⋄ and v ⋄ are reﬁned into w ⋆ = exp(j ∠ w ⋄ ) and v ⋆ = exp(j ∠ v ⋄ ) . The ﬁnalsolution is given by t ⋆k = 1 / ( K ( w ⋆ ) H h k ) and F ⋆ = ( w ⋆ ) H v ⋆ . With t ⋆k and F ⋆ , r ⋆k is computedusing (26). Remark 3 (Number of Phase Shifters):

For the UM-AirComp scheme with AGP, the analogbeamforming matrix F ⋆ = ( w ⋆ ) H v ⋆ is rank-one. Therefore, it only costs N phase shifters atthe edge server, which is signiﬁcantly smaller than N phase shifters used in PAM-based UM-AirComp scheme. Hence, it is particularly suitable for FL systems with massive antenna arraysat the edge. In practice, the number of phase shifters can also take the values of (4 N, N · · · ) by varying the number of RF chains.VI. S IMULATION R ESULTS AND D ISCUSSIONS

This section presents simulation results to verify the performance of the proposed scheme.The pathloss of the user k is set to ̺ k = −

60 dB , and h k and g k are generated according to CN ( , ̺ k I N ) . The power scaling factor γ = 1 and the maximum transmit powers at users are P = 10 mW (i.e.,

10 dBm ). The number of local updates E = 1 . The noise powers at theserver and users are set as −

90 dBm , which capture the effects of thermal noise, receiver noise,and interference. Each point of MSE is obtained by averaging over runs, with independentchannel and noise realization in each run. All problems are solved by Matlab R2019a on adesktop with an Intel Core i7-7700 CPU at 3.6 GHz and 16 GB RAM. The Interior point methodis implemented using CVX Mosek [32], a Matlab software package for convex optimization.

A. Performance Evaluation of PAM-based and AGP-based UM-AirComp

First, the convergence behavior of the proposed PAM is demonstrated. According to Section IV,PAM consists of two iteration layers. To verify the convergence of the inner layer (i.e., equations Number of Inner Iterations O b j e c t i v e F un c t i on o f P F (a) Number of Outer Iterations O b j e c t i v e F un c t i on o f P (b) Fig. 3: a) the objective function of P F versus the number of inner iterations m at n = 0 when N = 8 ; b) the objective function of P versus the number of outer iterations n when N = 8 .(22)–(24)) of the proposed PAM, Fig. 3a shows the objective function of P F (i.e. (19)) versus thenumber of inner iterations m = (1 , · · · , at n = 0 when N = 8 and K = 10 . It can be seenthat the inner layer for updating F converges after iterations with ρ = 0 . . Besides, sincethe convergence behaviors for equations (28)–(29) are similar, they are not repeated here. Onthe other hand, to verify the convergence of the outer layer (i.e., equations (16a)–(16c)) of theproposed PAM, Fig. 3b shows the objective function of P (i.e. (15a)) versus the number of outeriterations n = (1 , · · · , . As observed from the ﬁgure, the proposed algorithm converges andthe MSE stabilizes after iterations, which indicates that the number of outer iterations N max for PAM to converge is small. To reduce the runtime in practice, the inner and outer problemscan both be solved approximately, and we set M max = 200 and N max = 20 in the subsequentsimulations.Next, to verify the learning performance of PAM-based UM-AirComp, we consider threebenchmark schemes:1) Baseline scheme, which sets F = I N and { t k = √ P |∀ k } . The receiver coefﬁcients { r k } are computed using (26).2) The ﬁxed beamforming scheme, which sets F = I N . The transmitter coefﬁcients { t k } andreceiver coefﬁcients { r k } are optimized iteratively by solving (16c) using CVX and (16b)using (26), respectively.3) The UM-AirComp scheme with SDR and CVX. The analog beamformer F , the transmittercoefﬁcients { t k } , and the receiver coefﬁcients { r k } are optimized iteratively by solving (16a) using SDR, (16c) using CVX, and (16b) using (26), respectively. We simulate the deep learning task of Example 2 mentioned in Section II. It is assumed thatthe channel coherence time lasts for FL iterations and the total number of FL iterations isset to . The result is shown in Fig. 4. Speciﬁcally, the axis limits of the radar map inFig. 4a are (0 . , . , . , . for normalized MSE (i.e., the objective function of P ), training loss in (1), testing error, and computation time, respectively. Since our goal is tominimize all these metrics concurrently, a smaller area indicates better performance. It can beseen from Fig. 4a that the proposed PAM-based UM-AirComp scheme achieves the smallest areaamong all the simulated schemes. In particular, the area of the proposed scheme is completelycovered by that of the ﬁxed beamforming scheme, and the area reduction comes from the analogbeamforming design and the low complexity nature of PAM. The scheme with SDR and CVXis the most time consuming, making it not applicable in practice. On the other hand, althoughthe baseline scheme is the fastest, it has the highest MSE, training loss, and test error.Fig. 4b and Fig. 4c show the training loss and the testing error versus the number of FLiterations, respectively. Due to the high MSE (close to . ), FL with the baseline schemediverges. The FL with ﬁxed beamforming is competitive compared with the proposed schemeat the beginning, but the training loss does not further reduce with the increasing number of FLiterations. Besides, the test error increases after iterations. This means that model parametererrors have a stronger impact as the number of FL iterations increases. This is because deeplearning models are usually over parameterized. Thus, as the training procedure gets closerto convergence, the model parameters would get sparser, which are more sensitive to modelparameter errors. Based on the above observations, we can adopt an approximate solution (e.g.,executing few iterations of the PAM algorithm) for edge FL designs at the early stage andswitch to high-performance solution (e.g., executing PAM algorithm until convergence) at thelatter stage. On the other hand, the scheme with SDR and CVX has the same performance asthe proposed scheme. However, as indicated in Fig. 4a, it requires a much higher runtime.To evaluate the solution quality and runtime of the proposed AGP-based UM-AirComp, wesimulate the case of N = { , , , } with K = 4 . It can be seen from Fig. 5a that thescheme with SDR and CVX is the most time consuming, and it fails in providing a solution If the matrix solution to the SDR problem of (16a) is not rank-one, we use the principal component of the obtained matrixas the phase shift design. Normalized MSE

Training Loss

Run Time

Testing Error

Baseline Fixed beamforming SDR Proposed PAM (a)

Number of FL Iterations T r a i n i ng Lo ss BaselineFixed beamformingSDR+CVXProposed PAM-based UM-AirComp (b)

Number of FL Iterations T e s t E rr o r BaselineFixed beamformingSDR+CVXProposed PAM-based UM-AirComp (c)

Fig. 4: Comparison between the proposed and benchmark schemes when N = 8 with K = 10 : a)comparison of normalized MSE, training loss, test error, and runtime; b) training loss versus thenumber of FL iterations; c) worst test error among all users versus the number of FL iterations. Number of Antennas -4 -2 A v e r age R unn i ng T i m e ( s ) SDR+CVXProposed PAM-based UM-AirCompFixed beamformingProposed AGP-based UM-AirCompBaseline (a) Number of Antennas -8 -6 -4 -2 M a x i m u m M SE BaselineFixed beamformingProposed AGP-based UM-AirCompProposed PAM-based UM-AirCompSDR+CVX (b)

Fig. 5: Run time and normalized MSE versus the number of antennas when K = 4 .within a reasonable amount of time for the case of N = { , } . Note that the proposedPAM-based UM-AirComp, although faster than SDR method, still requires a high runtime at N = { , } . On the other hand, the proposed AGP-based UM-AirComp and the baselinescheme require runtimes two orders of magnitude smaller than that of other schemes. However,as shown in Fig. 5b, the proposed AGP-based UM-AirComp signiﬁcantly outperforms that ofthe baseline. B. UM-AirComp for V2X Autonomous Driving

Finally, to verify the robustness of the proposed UM-AirComp framework in complex learningtasks, we consider the V2X-aided FL for 3D object detection. We employ the CARLA simulationplatform [24] to generate training/testing scenarios and multi-agent point cloud datasets. Each (cid:1)(cid:2) (cid:2)(cid:7) (cid:2)(cid:8)(cid:3)(cid:8)(cid:5) (cid:1) (cid:1)(cid:3)(cid:2)(cid:4) (cid:1) (cid:1) (cid:1) (cid:1) (cid:1)(cid:1)(cid:7) (cid:6) (cid:3)(cid:4)(cid:3)(cid:7) (cid:6)(cid:5)(cid:3)(cid:1)(cid:2)(cid:4) (cid:5)(cid:6)(cid:7) (cid:3)(cid:4)(cid:1)(cid:5)(cid:3)(cid:2)(cid:3) (cid:1)(cid:4)(cid:7)(cid:8) (a) (b) (c) (d) Fig. 6: Detection results when N = 500 with K = 4 . The red box is the ground truth; theblue box is from the proposed AGP-based UM-AirComp scheme; the green box is from theﬁxed beamforming (benchmark) scheme. a) the benchmark scheme detects nothing while theproposed scheme detects two objects; b) the benchmark scheme only detects two nearby objectswhile the proposed scheme can detect far-away objects; c) the benchmark scheme generatesfalse positive results while the proposed scheme generates accurate prediction; d) the benchmarkscheme cannot detects occlusion objects while the proposed scheme detects all of them. Sensing region of vehicle 1 (Fig. 6a)Sensing region of vehicle 4 (Fig. 6d) Sensing region of vehicle 3 (Fig. 6c)Sensing region of vehicle 2 (Fig. 6b)

Vehicle 1 (Fig. 6a)

Vehicle 4 (Fig. 6d) Vehicle 3 (Fig. 6c)

Vehicle 2 (Fig. 6b)

Objects

Objects (a) (cid:28611)(cid:28609)(cid:28611)(cid:28611)(cid:28611)(cid:28609)(cid:28612)(cid:28611)(cid:28611)(cid:28609)(cid:28613)(cid:28611)(cid:28611)(cid:28609)(cid:28614)(cid:28611)(cid:28611)(cid:28609)(cid:28615)(cid:28611)(cid:28611)(cid:28609)(cid:28616)(cid:28611)(cid:28611)(cid:28609)(cid:28617)(cid:28611) (cid:28611)(cid:28609)(cid:28618)(cid:28611)(cid:28611)(cid:28609)(cid:28619)(cid:28611) (cid:28611)(cid:28609)(cid:28620)(cid:28611)(cid:28612)(cid:28609)(cid:28611)(cid:28611) (cid:28618)(cid:28633)(cid:28636)(cid:28637)(cid:28631)(cid:28640)(cid:28633)(cid:28564)(cid:28581) (cid:28618)(cid:28633)(cid:28636)(cid:28637)(cid:28631)(cid:28640)(cid:28633)(cid:28564)(cid:28582) (cid:28618)(cid:28633)(cid:28636)(cid:28637)(cid:28631)(cid:28640)(cid:28633)(cid:28564)(cid:28583) (cid:28618)(cid:28633)(cid:28636)(cid:28637)(cid:28631)(cid:28640)(cid:28633)(cid:28564)(cid:28584) (cid:28564)(cid:28602)(cid:28637)(cid:28652)(cid:28633)(cid:28632)(cid:28564)(cid:28598)(cid:28633)(cid:28629)(cid:28641)(cid:28634)(cid:28643)(cid:28646)(cid:28641)(cid:28637)(cid:28642)(cid:28635) (cid:28564)(cid:28612)(cid:28646)(cid:28643)(cid:28644)(cid:28643)(cid:28647)(cid:28633)(cid:28632)(cid:28564)(cid:28597)(cid:28603)(cid:28612) (b)

Fig. 7: a) The global bird eye view of the frame in Fig. 6. The green vehicle is in Fig. 6a.The pink vehicle is in Fig. 6b. The blue vehicle is in Fig. 6c. The cyan vehicle is in Fig. 6d.b) Comparison between the proposed AGP-based UM-AirComp and the ﬁxed beamformingschemes when N = 500 with K = 4 . intelligent vehicle (Tesla Model 3) is equipped with a -line LiDAR at Hz. The defaultLiDAR range is set to 100 m, and its FoV is ◦ in the front. We use the “Town02” map[24] with objects and autonomous vehicles. The entire scenario lasts for seconds andcontains frames at each autonomous vehicle. The ﬁrst frames are used for FL training.The sampling rate is 1/2 and hence frames are used for training at each vehicle. The latter frames are used for inference and testing. Each frame is considered as a input data d in l , whichconsists of around , points and each point has coordinates. The corresponding objectlabels d out l = { [ c m , x m , y m , z m , l m , w m , h m , ϑ m ] T } Mm =1 , where M is the number of objects, c m isthe category, ( x m , y m , z m ) are the center coordinates, ( l m , w m , h m ) stand for the length, width,and height, and ϑ m denotes the yaw rotation around the z-axis of the m -th object, respectively.Thus each sample is d l = (cid:0) d in l , d out l (cid:1) ∈ {D , · · · , D K } . The average precision at IoU = 0 . isused for performance evaluation.The sparsely embedded convolutional detection (SECOND) neural network [21] is used forobject detection on the CARLA dataset. The local model structure can be found in [21].The loss function Θ( d l , x k ) = f class (cid:0) x k , d in l , { c m } (cid:1) + f box (cid:0) x k , d in l , { x m , y m , z m , l m , w m , h m } (cid:1) + f soft (cid:0) x k , d in l , { ϑ m } (cid:1) , where f class is the classiﬁcation loss, f box is the box regression loss, and f soft is the softmax orientation estimation loss. Since the multi-agent data generated by CARLAis not compatible to the SECOND network, we develop a data transformation module such thatthe generated dataset satisﬁes the KITTI standard [25]. The SECOND network is trained with adiminishing learning rate, where the initial learning rate is set to − and the number of localupdates is E = 1 . The experiment is done using Python 3.6 in Ubuntu 18.04 with a GeForceGTX 3090 GPU.In our experiment, the case of N = 500 and K = 4 is simulated. It is assumed that the sensingdatasets are generated and pre-stored at the vehicles before transmission. The channels { h k , g k } are randomly generated according to CN ( , ̺ k I N ) with a pathloss of ̺ k = −

60 dB . The totalnumber of FL iterations is , with independently generated channel in each FL iteration. Forautonomous vehicles, its transmit power is larger than IoT devices. Thus we set P = 30 mW (i.e., ), and the noise powers remain the same as above subsections. The normalized MSEsof the proposed AGP-based UM-AirComp and the ﬁxed beamforming (benchmark) schemes are × − and . , respectively. The detection results are shown in Fig. 6 and the globalbird eye view of this frame is shown in Fig. 7a. The comparison between the proposed andbenchmark schemes is provided in Fig. 7b. From the above results, it can be seen that the MSEs of both schemes are remarkablysmaller than their counterparts in Fig. 4a. However, the learning performance of Fig. 6 andFig. 7b is worse than that in Fig. 4b and Fig. 4c, which implies that object detection tasks inautonomous driving are more sensitive to model parameter errors. This is because the trainedmodel parameters for the SECOND network are generally sparser and a slight model error wouldlead to completely different predictions. Furthermore, as seen from the objective of P , the modelerror of user k with the UM-AirComp framework is dominated by γ P Kj =1 (cid:12)(cid:12)(cid:12) r k g Hk Fh j t j − α j (cid:12)(cid:12)(cid:12) .Therefore, the methods to minimize the model errors can be categorized into 1) conﬁgurationof wireless channels { h k , g k } and 2) design of analog beamformer F . For the conﬁguration ofwireless channels, due to α = · · · = α K , the ideal wireless channels for the proposed UM-AirComp satisfy g = · · · = g K and h = · · · = h K . This implies that the UM-AirComp shouldbe executed when vehicles are geographically close to each other such that the magnitudesof { h k , g k } are close. This is the case in vehicle platooning and vehicle parking scenarios[45]; otherwise, some emerging techniques (e.g., reconﬁgurable intelligence surfaces (RIS) [5])should be adopted to smartly alter the wireless environment. On the other hand, for the design ofbeamformers, the key is to align the various channels { h k , g k } to a same direction and power fordecoding superposed signals. This is the case of Fig. 7b, where the proposed method achievesmuch larger average precisions than the ﬁxed beamforming scheme for all vehicles. Note thatthe runtime of AGP-based UM-AirComp is only . s, which can be further accelerated viadedicated GPU. VII. C ONCLUSION

This paper proposed the UM-AirComp scheme to support simultaneous transmission of localmodel parameters in edge federated learning systems. The training loss upper bound of UM-AirComp was derived, which reveals that the key to minimize FL training loss is to minimize themaximum MSE among all users. Two low-complexity large-scale optimization algorithms wereproposed to tackle the nonconvex nonsmooth loss bound minimization problem. The performanceand runtime of the UM-AirComp framework with the proposed optimization algorithms wereveriﬁed using regression and classiﬁcation tasks. The performance of the proposed framework andalgorithms were also veriﬁed in a V2X autonomous driving simulation platform and experimentalresults have shown that the object detection precision with the proposed algorithm is signiﬁcantlyhigher than that achieved by benchmark schemes. A PPENDIX AP ROOF OF T HEOREM E (cid:2) k ∆ x [ i ] k (cid:3) ≤ (3 + 2 K − ) max k MSE [ i ] k ,where ∆ x [ i ] = x [ i +1] k (0) − h x [ i ] k (0) − ε ∇ x Λ( x [ i ] k (0)) i . Then, using Assumption 1 and Lipschitzconditions, the relationship between Λ( x [ i +1] k (0)) − Λ( x [ i ] k (0)) and E (cid:2) k ∆ x [ i ] k (cid:3) is obtained. Lastly,applying the former relationship recursively, the relationship between E h Λ( x [ i +1] k (0)) − Λ( θ ∗ ) i and E (cid:2) k ∆ x [ i ] k (cid:3) is obtained. The details are given below.

1) Bounding k ∆ x [ i ] k . We ﬁrst derive the following upper bound k ∆ x [ i ] k = (cid:13)(cid:13)(cid:13) x [ i +1] k (0) − x [ i ] k (0) + ε ∇ x Λ (cid:16) x [ i ] k (0) (cid:17) (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13) x [ i +1] k (0) − θ [ i ] + K X j =1 α j x [ i ] j (0) − ε P Kj =1 |D j | K X j =1 X d l ∈D j ∇ x Θ (cid:16) d l , x [ i ] j (0) (cid:17) − x [ i ] k (0) + ε ∇ x Λ( x [ i ] k (0)) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) x [ i +1] k (0) − θ [ i ] (cid:13)(cid:13)(cid:13) + K X j =1 α j (cid:13)(cid:13)(cid:13) x [ i ] j (0) − x [ i ] k (0) (cid:13)(cid:13)(cid:13) + ε ( P Kj =1 |D j | ) K X j =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X d l ∈D j ∇ x Θ (cid:16) d l , x [ i ] j (0) (cid:17) − X d l ∈D j ∇ x Θ (cid:16) d l , x [ i ] k (0) (cid:17) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (50)where the second equality is obtained from (2) with E = 1 and (6), and the inequality is obtaineddue to k a + a k ≤ k a k + k a k . On the other hand, according to (11), we have E (cid:20)(cid:13)(cid:13)(cid:13) x [ i +1] k (0) − θ [ i ] (cid:13)(cid:13)(cid:13) (cid:21) = MSE [ i ] k , E (cid:20)(cid:13)(cid:13)(cid:13) x [ i ] k (0) − x [ i ] j (0) (cid:13)(cid:13)(cid:13) (cid:21) = E (cid:20)(cid:13)(cid:13)(cid:13) x [ i ] k (0) − θ [ i − + θ [ i − − x [ i ] j (0) (cid:13)(cid:13)(cid:13) (cid:21) ≤ MSE [ i ] k . (51)Moreover, according to Assumption 1, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) X d l ∈D j ∇ x Θ (cid:16) d l , x [ i ] j (0) (cid:17) − X d l ∈D j ∇ x Θ (cid:16) d l , x [ i ] k (0) (cid:17) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  ≤ E h L k x [ i ] k (0) − x [ i ] j (0) k i ≤ L MSE [ i ] k . (52)Putting (51) and (52) into (50), and according to the expression of ε yields E (cid:2) k ∆ x [ i ] k (cid:3) ≤ (cid:0) K − (cid:1) max k =1 , ··· ,K MSE [ i ] k . (53)

2) Bounding Λ( x [ i +1] k (0)) − Λ( x [ i ] k (0)) . Due to P d l ∈D k ∇ x Θ( d l , x ) (cid:22) L I , we have ∇ x Λ( x ) (cid:22) KL/ ( P k |D k | ) I . Based on µ I (cid:22) ∇ x Λ( x ) (cid:22) KL/ ( P k |D k | ) I , the following equations hold [40]: Λ( x ′ ) ≤ Λ( x ) + ( x ′ − x ) T ∇ Λ( x ) + KL P Kk =1 |D k | k x ′ − x k , (54a) Λ( x ′ ) ≥ Λ( x ) + ( x ′ − x ) T ∇ Λ( x ) + µ k x ′ − x k . (54b)Putting x ′ = x [ i +1] k (0) = x [ i ] k (0) − ε ∇ x Λ (cid:16) x [ i ] k (0) (cid:17) + ∆ x [ i ] and x = x [ i ] k (0) into (54a), we have Λ (cid:16) x [ i +1] k (0) (cid:17) ≤ Λ (cid:16) x [ i ] k (0) (cid:17) − P Kj =1 |D j | KL (cid:13)(cid:13)(cid:13) ∇ x Λ (cid:16) x [ i ] k (0) (cid:17) (cid:13)(cid:13)(cid:13) + KL P Kj =1 |D j | k ∆ x [ i ] k . (55)On the other hand, the right hand side of (54b) is minimized at x ′ = x − µ − ∇ x Λ( x ) . Puttingthis expression and x = x [ i ] k (0) into (54b) gives k∇ x Λ( x [ i ] k (0)) k ≥ µ h Λ( x [ i ] k (0)) − Λ( θ ∗ ) i . (56)Combining (55) and (56) gives Λ( x [ i +1] k (0)) − Λ( x [ i ] k (0)) ≤ − µ P Kj =1 |D j | KL ! h Λ( x [ i ] k (0)) − Λ( θ ∗ ) i + KL P Kj =1 |D j | k ∆ x [ i ] k . (57)

3) Bounding E h Λ( x [ i +1] k (0)) − Λ( θ ∗ ) i . Applying (57) recursively leads to Λ( x [ i +1] k (0)) − Λ( θ ∗ ) ≤ − µ P Kj =1 |D j | KL ! i +1 h Λ( x [0] k (0)) − Λ( θ ∗ ) i + i X i ′ =0 KL P Kj =1 |D j | − µ P Kj =1 |D j | KL ! i − i ′ k ∆ x [ i ′ ] k . (58)Taking expectation on both sides and applying (53), (58) becomes E h Λ( x [ i +1] k (0)) − Λ( θ ∗ ) i ≤ − µ P Kj =1 |D j | KL ! i +1 h Λ( x [0] k (0)) − Λ( θ ∗ ) i + i X i ′ =0 A [ i ′ ] max k MSE [ i ′ ] k . (59)Finally, taking the limit i → + ∞ and using (cid:16) − µ P Kj =1 |D j | / ( KL ) (cid:17) i +1 → , the proof iscompleted. A PPENDIX BP ROOF OF L EMMA U ( v ′ ) is the optimal solution to min k v k ≤ β max b ∈ ∆ g ( v , v ′ , b ) , (60)where g ( v , v ′ , b ) is a function satisfying g ( v , v ′ , b ) ≥ h ( v , b ) , g ( v ′ , v ′ , b ) = h ( v ′ , b ) , and ∇ g ( v ′ , v ′ , b ) = ∇ h ( v ′ , b ) . Then according to [35] and the properties of g ( v , v ′ , b ) , every limitpoint of the sequence ( v (0) , v (1) , · · · ) generated by v ( n +1) ← U ( v ( n ) ) and a feasible v (0) is theKKT solution to (34).Speciﬁcally, deﬁne a surrogate function of h ( v , b ) as g ( v , v ′ , b ) = − K X k =1 b k | g Hk v | σ k + K X k =1 b k ( v − v ′ ) H g k g Hk ( v − v ′ ) σ k . (61)It can be veriﬁed that g ( v , v ′ , b ) ≥ h ( v , b ) , g ( v ′ , v ′ , b ) = h ( v ′ , b ) , and ∇ g ( v ′ , v ′ , b ) = ∇ h ( v ′ , b ) . Therefore, to prove the lemma, it remains to show that U ( v ′ ) is the optimal solutionto (60) with g ( v , v ′ , b ) given by (61). Applying the quasi-concave-convex property of (60) andthe general minimax theorem [36], we have min k v k ≤ β max b ∈ ∆ g ( v , v ′ , b ) = max b ∈ ∆ min k v k ≤ β g ( v , v ′ , b ) . (62)Via the Lagrange multiplier method, it can be derived that arg min k v k ≤ β g ( v , v ′ , b ) = √ β C ( v ′ ) b (cid:13)(cid:13) C ( v ′ ) b (cid:13)(cid:13) . (63)Putting the above result into g ( v , v ′ , b ) , we have g √ β C ( v ′ ) b (cid:13)(cid:13) C ( v ′ ) b (cid:13)(cid:13) , v ′ , b ! = − Φ ( v ′ , b ) . (64)Therefore, the optimal solution of b to (62) is arg min b ∈ ∆ Φ ( v ′ , b ) . Putting arg min b ∈ ∆ Φ ( v ′ , b ) into (63), the optimal solution of v to (62) (thus (60)) is U ( v ′ ) and the proof is completed.A PPENDIX CP ROOF OF L EMMA

Lemma 4. ( [40, Lemma 1.2.2]) If h ( x ) is convex and twice differentiable, then h ( x ) is Lipschitzsmooth with constant L if and only if ∇ h ( x ) (cid:22) L I . Based on Lemma 4 and since Ξ( b ) is convex and twice differentiable, it sufﬁces to show ∇ Ξ( b ) (cid:22) L Ξ ( φ ) K . In particular, according to (41), the Hessian matrix of Ξ( b ) is ∇ Ξ( b ) = 2 √ β Re (cid:0) C H C (cid:1)p φ + k Cb k − √ β ( p φ + k Cb k ) × Re (cid:0) C H Cb (cid:1) [Re (cid:0) C H Cb (cid:1) ] H . Due to Re (cid:0) C H Cb (cid:1) [Re (cid:0) C H Cb (cid:1) ] H (cid:23) , we can drop the last term to bound ∇ Ξ( b ) fromabove, which leads to ∇ Ξ( b ) (cid:22) √ β p φ + k Cb k × Re (cid:0) C H Cb (cid:1) (cid:22) √ β · λ max (cid:2) Re (cid:0) C H C (cid:1)(cid:3)p φ + k Cb k I K , (65)where the second inequality follows from Re (cid:0) C H C (cid:1) (cid:22) λ max (cid:2) Re (cid:0) C H C (cid:1)(cid:3) I K .Now, the only quantity in (65) that is dependent on b is k Cb k . To get rid of such dependence, k Cb k is lower bounded by k Cb k = b T C H Cb ≥ λ min (cid:0) C H C (cid:1) × k b k . (66)Finally, using k b k ≥ /K due to b ∈ ∆ and Cauchy-Schwarz inequality further leads to k Cb k ≥ λ min (cid:0) C H C (cid:1) /K. (67)Replacing k Cb k in (65) with the right hand side of (67), we immediately obtain ∇ Ξ( b ) (cid:22) L Ξ ( φ ) I K . A PPENDIX DP ROOF OF T HEOREM

Rank ( C ) = Rank ([ g , · · · , g K ]) due to thedeﬁnition of C in (37). As Rank (cid:0) C H C (cid:1) = Rank ( C ) , Rank (cid:0) C H C (cid:1) = K and λ min (cid:0) C H C (cid:1) > .Putting λ min (cid:0) C H C (cid:1) > and φ = 0 into (43), we obtain L Ξ (0) = 2 √ β (cid:13)(cid:13) Re (cid:0) C H C (cid:1) (cid:13)(cid:13) p λ min ( C H C ) /K < + ∞ . (68)Next, to prove part (ii), it can be seen that λ min (cid:0) C H C (cid:1) = 0 if Rank ([ g , · · · , g K ]) = K .Putting this result and φ = 0 into (43), we obtain L Ξ (0) = + ∞ . On the other hand, if φ > ,we must have p φ + k Cb k > due to k Cb k ≥ . Putting this result into (43), we obtain L Ξ ( φ ) < + ∞ if φ > .Finally, to prove part (iii) of this theorem, we need the following lemma. Lemma 5. Ξ( b ) is bounded as Φ( b ) ≤ Ξ( b ) ≤ Φ( b ) + 2 √ β φ . Proof.

To prove the left inequality, notice that k Cb k ≤ p φ + k Cb k . Putting this resultinto Ξ( b ) in (41), we immediately obtain Φ( b ) ≤ Ξ( b ) . On the other hand, to prove the rightinequality, we ﬁrst compute Ξ( b ) − Φ( b ) = 2 p β (cid:16)q φ + k Cb k − q k Cb k (cid:17) . (69)Then applying the identity p φ + x − √ x = φ √ φ + x + √ x ≤ φ , where the inequality is due to themonotonic decreasing feature of φ / ( p φ + x + √ x ) with respect to x , the right hand side of(69) is upper bounded as p β (cid:16)q φ + k Cb k − q k Cb k (cid:17) ≤ p β φ. (70)Putting this result into (69), we have Ξ( b ) − Φ( b ) ≤ √ β φ .Now, if an ǫ -optimal solution b ′ ∈ ∆ to Q is obtained with Ξ( b ′ ) − Ξ( b ⋄ ) ≤ ǫ , then wemust have Φ( b ′ ) ≤ Ξ( b ⋄ ) + ǫ, (71)due to Φ( b ) ≤ Ξ( b ) from the ﬁrst inequality of Lemma 5. On the other hand, taking theminimum on both sides of the second inequality of Lemma 5, we have min b ∈ ∆ Ξ( b ) ≤ min b ∈ ∆ Φ( b ) + 2 p β φ. (72)Putting min b ∈ ∆ Ξ( b ) = Ξ( b ⋄ ) and min b ∈ ∆ Φ( b ) = Φ( b ∗ ) into (72), (72) becomes Ξ( b ⋄ ) ≤ Φ( b ∗ ) + 2 √ β φ . Combining this result with (71) leads to Φ( b ′ ) ≤ Φ( b ∗ ) + 2 √ β φ + ǫ , and part(iii) of this theorem is proved. R EFERENCES [1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,”

Nature , vol. 521, pp. 436–444, May 2015.[2] W. Saad, M. Bennis, and M. Chen, “A vision of 6G wireless systems: Applications, trends, technologies, and open researchproblems,”

IEEE Netw. , vol. 34, no. 3, pp. 134–142, May/June 2020.[3] S. Yu, X. Chen, L. Yang, D. Wu, M. Bennis, and J. Zhang, “Intelligent edge: Leveraging deep imitation learning formobile edge computation ofﬂoading,”

IEEE Wireless Commun. vol. 27, no. 1, pp. 92–99, Feb. 2020.[4] S. Wang, Y.-C. Wu, M. Xia, R. Wang, and H. V. Poor, “Machine intelligence at the edge with learning centric powerallocation,”

IEEE Trans. Wireless Commun. , vol. 19, no. 11, pp. 7293–7308, Nov. 2020.[5] S. Huang, S. Wang, R. Wang, M. Wen, and K. Huang, “Reconﬁgurable intelligent surface assisted mobile edgecomputing with heterogeneous learning tasks,”

IEEE Trans. Cognitive Commun. Networking , early access, 2021. DOI:10.1109/TCCN.2021.3056707.[6] Google Research, [Online]. Available: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html. [7] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. Arcas, “Communication-efﬁcient learning of deep networksfrom decentralized data,” in Proc. AISTATS , Fort Lauderdale, Florida, Apr. 2017.[8] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “Adaptive federated learning in resourceconstrained edge computing systems,”

IEEE J. Sel. Areas Commun. , vol. 37, no. 6, pp. 1205–1221, Jun. 2019.[9] M. Chen, Z. Yang, W. Saad, C. Yin, H. V. Poor, and S. Cui, “A joint learning and communications framework for federatedlearning over wireless networks,”

IEEE Trans. Wireless Commun. , vol. 20, no. 1, pp. 269–283, Jan. 2021.[10] T. Zen, O. Semiari, M. Mozaffari, M. Chen, W. Saad, and M. Bennis, “Federated learning in the sky: Joint power allocationand scheduling with UAV swarms,” in

Proc. IEEE ICC , Dublin, Ireland, Jun. 2020.[11] Y. Du, S. Yang, and K. Huang, “High-dimensional stochastic gradient quantization for communication-efﬁcient edgelearning,”

IEEE Trans. Signal Process. , vol. 68, pp. 2128–2142, Mar. 2020.[12] T. T. Vu, D. T. Ngo, N. H. Tran, H. Q. Ngo, M. N. Dao, and R. H. Middleton, “Cell-free massive MIMO for wirelessfederated learning,”

IEEE Trans. Wireless Commun. , vol. 19, no. 10, pp. 6377–6392, Oct. 2020.[13] G. Zhu and K. Huang, “MIMO over-the-air computation for high-mobility multimodal sensing,”

IEEE Internet of ThingsJ. , vol. 6, no. 4, pp. 6089–6103, Aug. 2019.[14] M. M. Amiri and D. G¨und¨uz, “Machine learning at the wireless edge: Distributed stochastic gradient descent over-the-air,”

IEEE Trans. Signal Process. , vol. 68, pp. 2155–2169, Mar. 2020.[15] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation for low-latency federated edge learning,”

IEEE Trans.Wireless Commun. , vol. 19, no. 1, pp. 491–506, Jan. 2020.[16] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,”

IEEE Trans. Wireless Commun. vol. 19, no. 3, pp. 2022–2035, Mar. 2020.[17] H. Guo, A. Liu, and V. K. N. Lau, “Analog gradient aggregation for federated learning over wireless networks: Customizeddesign and convergence analysis,”

IEEE Internet of Things J. , early access, 2020. DOI: 10.1109/JIOT.2020.3002925.[18] G. Zhu, Y. Du, D. G¨und¨uz, and K. Huang, “One-bit over-the-air aggregation for communication-efﬁcient feder-ated edge learning: Design and convergence analysis,”

IEEE Trans. Wireless Commun. , early access, 2020. DOI:10.1109/TWC.2020.3039309.[19] F. Sohrabi and W. Yu, “Hybrid digital and analog beamforming design for large-scale antenna arrays,”

IEEE J. Sel. TopicsSignal Process. , vol. 10, no. 3, pp. 501–513, Apr. 2016.[20] V. Venkateswaran and A. J. van der Veen, “Analog beamforming in MIMO communications with phase shift networks andonline channel estimation,”

IEEE Trans. Signal Process. , vol. 58, no. 8, pp. 4131–4143, Aug. 2010.[21] Y. Yan, Y. Mao, and B. Li, “SECOND: Sparsely embedded convolutional detection,”

Sensors , vol. 18, no. 10, pp. 3337,Oct. 2018.[22] S. Shi, Z. Wang, J. Shi, X. Wang, and H. Li, “From points to parts: 3D object detection from point cloudwith part-aware and part-aggregation network,”

IEEE Trans. Pattern Anal. Mach. Intell. , early access, 2020. DOI:10.1109/TPAMI.2020.2977026.[23] R. Gudwin, E. Rohmer, A. Paraense, E. Froes, W. Gibaut, I. Oliveira, S. Rocha, K. Raizer, and A. V. Feljan, “The TROCAProject: An autonomous transportation robot controlled by a cognitive architecture,”

Cognitive Systems Research , vol. 59,pp. 179-197, Jan. 2020.[24] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: An open urban driving simulator,” in

Proc. The1st Annual Conference on Robot Learning , 2017, pp. 1–16.[25] M. L. Psiaki, S. P. Powell, H. Jung, and P. M. Kintner, “Design and practical implementation of multifrequency RF frontends using direct RF sampling,”

IEEE Trans. Microw. Theory Techn. , vol. 53, no. 10, pp. 3082–3089, Oct. 2005. [26] M. P. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data ﬁtting,” SIAM J. Sci. Comput. , vol. 34,no. 3, pp. 1380–1405, May 2012.[27] X. Zhang,

Matrix Analysis and Applications . Beijing, China: Tsinghua Univ. Press, 2004.[28] A. Beck, “On the convergence of alternating minimization for convex programming with applications to iterativelyreweighted least squares and decomposition schemes,”

SIAM J. Optimiz. , vol. 25, no. 1, pp. 185–209, 2015.[29] Y. Wang, J. Yang, W. Yin, and Y. Zhang, “A new alternating minimization algorithm for total variation imagereconstruction,”

SIAM J. Imaging Sciences , vol. 1, no. 3, pp. 248–272, 2008.[30] S. Boyd and L. Vandenberghe,

Convex Optimization . Cambridge, U.K.: Cambridge Univ. Press, 2004.[31] B.-R. Marks, and G.-P. Wright, “A general inner approximation algorithm for nonconvex mathematical programs,”

Operation Research , vol. 26, no. 4, Jul. 1978.[32] A. Ben-Tal and A. Nemirovski,

Lectures on Modern Convex Optimization (MPS/SIAM Series on Optimizations).Philadelphia, PA, USA: SIAM, 2013.[33] S. Wang, L. Cheng, M. Xia, and Y.-C. Wu, “Massive MIMO multicast beamforming via accelerated random coordinatedescent,” in

Proc. IEEE ICASSP’19 , Brighton, UK, May 2019, pp. 4494–4498.[34] N. Parikh and S. Boyd, “Proximal algorithms,”

Foundations and Trends in Optimization , vol. 1, no. 3, pp. 127–239, 2014.[35] Y. Sun, P. Babu, and D. P. Palomar, “Majorization-minimization algorithms in signal processing, communications, andmachine learning,”

IEEE Trans. Signal Process. , vol. 65, no. 3, pp. 794–816, Feb. 2017.[36] M. Sion, “On general minimax theorems,”

Paciﬁc J. Math. , vol. 8, no. 1, pp. 171–176, 1958.[37] Y. Nesterov, “Smooth minimization of non-smooth functions,”

Math. Program., vol. 103, no. 1, pp. 127–152, May 2005.[38] W. Su, S. Boyd, and E. J. Candes, “A differential equation for modeling Nesterov’s accelerated gradient method: Theoryand insights”,

J. Mach. Learn. Res. , vol. 17, no. 153, pp. 1–43, Sep. 2016.[39] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,”

SIAM J. ImagingSci. , vol. 2, no. 1, pp. 183–202, Mar. 2009.[40] Y. Nesterov,

Introductory Lectures on Convex Optimization: A Basic Course.

Applied Optimization. Springer, 2004.[41] L. Condat, “Fast projection onto the simplex and the l ball,” Math. Program. , vol. 158, no. 1, pp. 575–585, Jul. 2016.[42] T. E. Booth, “Power iteration method for the several largest eigenvalues and eigenfunctions,”

Nucl. Sci. Eng. , vol. 154, no.1, pp. 48–62, 2006.[43] A. Goldsmith,

Wireless Communications . Cambridge University Press, 2005.[44] L. Deng, “The MNIST database of handwritten digit images for machine learning research,”

IEEE Signal Process. Mag. ,vol. 29, no. 6, pp. 141–142, Nov. 2012.[45] A. V. Feljan and Y. Jin, “A simulation framework for validating cellular V2X scenarios,” in