Over-The-Air Computation in Correlated Channels
aa r X i v : . [ c s . I T ] J u l Over-The-Air Computation for Distributed MachineLearning
Matthias Frey ∗ , Igor Bjelakovi´c † and Sławomir Sta ´nczak ∗†∗ Technische Universit¨at Berlin, Germany † Fraunhofer Heinrich Hertz Institute, Berlin, Germany
Abstract —Motivated by various applications in distributedMachine Learning (ML) in massive wireless sensor networks,this paper addresses the problem of computing approximatevalues of functions over the wireless channel and providesexamples of applications of our results to distributed trainingand ML-based prediction. The “over-the-air” computation of afunction of data captured at different wireless devices has ahuge potential for reducing the communication cost, which isneeded for example for training of ML models. It is of particularinterest to massive wireless scenarios because, as shown in thispaper, its communication cost for training scales more favorablewith the number of devices than that of traditional schemes thatreconstruct all the data. We consider noisy fast-fading channelsthat pose major challenges to the “over-the-air” computation. Asa result, function values are approximated from superimposednoisy signals transmitted by different devices. The fading andnoise processes are not limited to Gaussian distributions, andare assumed to be in the more general class of sub-gaussiandistributions. Our result does not assume necessarily independentfading and noise, thus allowing for correlations over time andbetween devices.
I. I
NTRODUCTION
Machine learning (ML) models are increasingly trained ondata collected by wireless sensor networks, with the goal toperform ML tasks in these networks such as prediction andclassification. ML has undeniably a great potential, but it is notavailable for free. The benefits of ML must be set in relation tothe effort and resources required. Since radio communicationresources (spectrum and energy) are generally scarce, thereis a strong interest [2] in resource-saving methods that wouldallow ML models to be efficiently trained and used in resource-constrained wireless networks. In fact, great efforts have beenmade in recent years to reduce the communication overheadfor training and deploying ML models.A common approach [3], [4] to the problem of resourcescarcity is to compress the distributed data before transmissionor to fuse it efficiently on its way through the network. Theadvantage of compression methods is undoubtedly that theyare application-independent and can basically be used in anycommunication network. However, compression methods canbe highly sub-optimal in noisy wireless channels [5] andcannot fully exploit specific requirements of the underlyingapplication for further resource savings [6]. Fusion-driven
Part of this work was presented in [1] at the 57th Annual AllertonConference on Communication, Control, and Computing, Sept. 24-27, 2019,Allerton Park and Retreat Center, Monticello, IL, USA.This work was supported by the German Research Foundation (DFG)within their priority programs SPP 1798 “Compressed Sensing in InformationProcessing” and SPP 1914 ”Cyber-Physical Networking”, as well as undergrant STA 864/7.This work was partly supported by a Nokia University Donation. Wegratefully acknowledge the support of NVIDIA Corporation with the donationof the DGX-1 used for this research. routing [7] combines compression with network layer opti-mization, but is also application-agnostic and only applies tomulti-hop networks.However, in Internet of Things (IoT) scenarios with a mas-sive number of low-cost, low-energy wireless devices, a radicalimprovement in spectral and energy efficiency is of utmostimportance. In fact, when it comes to data collection frommassively-deployed IoT devices, the scaling laws for capacityand energy are relevant and need to be improved, otherwisethe system performance may be severely degraded [8]. Suchimprovements can be achieved by abandoning the philoso-phy of strictly separating the process of communication andapplication-specific computation. This applies in particularwhen ML models are trained in wireless (sensor) networks,because the training requires only the computation of somefunctions of sensor data – the data of the individual sensors,in contrast, do not need to be decoded. Indeed, if the objectivefor a receiver is to compute a function f : R K → R of some K variables rather than to fully reconstruct all the individualvariables, then any strategy based on creating independentchannels is in general suboptimal even under optimal idealisticconditions (optimal strategies with no transmission errors).This can be immediately concluded from the data processinginequality which implies that no receiver-side processing of asignal can increase the information contained in the signal [9,Section 2.3]. This means that the entropy or the amount ofinformation contained in f ( s , . . . , s K ) , where s , . . . , s K arerandom variables, is smaller than or equal to the amountof information contained in the random vector ( s , . . . , s K ) .In many cases of practical interest, the information loss issignificant, as illustrated by the following example. Example 1.
Suppose that K nodes send their data s , . . . , s n to a single receiver over a multiple access channel. Forsimplicity, we assume that each s k is an independent randomvariable uniformly distributed over S = { , } . Now if thereceiver reconstructs each of these variables, then the entropyor the amount of information available at the receiver is P Kk =1 H ( s k ) = K bits where H : S → R ≥ : s P s ∈S p ( s ) log (1 /p ( s )) and p : S 7→ [0 , is the probabilitymass function. This means that the nodes have to transmitto the receiver K bits. Therefore, if the capacity of thecommunication channel is bit per channel use, then K channel uses are necessary to convey the full information to thereceiver. Now we assume that the receiver is only interestedin f ( s , . . . , s K ) = P k =1 s k which can be easily computedfrom the s k s. By the data processing inequality, this operation In the case of orthogonal channel access, it is necessary to establish K independent (interference-free) communication channels, each of whichhaving the capacity of bit per channel use. cannot increase the amount of information. In fact, the entropyof the function is H ( P k x k ) = K − P Kk =0 (cid:0) Kk (cid:1) − K log (cid:0) Kk (cid:1) which is strictly smaller than K for all K ≥ . This means thatinstead of transmitting K bits that are necessary to reconstructeach x k , the agents can send significantly less information tothe receiver if its objective is to compute the sum of the x k s. Of course, these observations are known and have alreadybeen used to improve the performance of wireless systems(see also Section I-A). In particular, it has been shown [6]that in some cases the spectral efficiency is much higher if thereceiver directly reconstructs the necessary function instead ofdecoding all the transmitted messages and then computing thefunction. Of particular importance is that, depending on thefunction to be reconstructed, a more favorable scaling law forcapacity can be achieved than with traditional separation-basedapproaches [6].Against this background, in this paper, we advocateapplication-specific schemes that take into account the un-derlying task, such as computation of a function, directlyat the physical layer. The key ingredients that open up thedoor to a paradigm shift in the design of such schemesare provided by a fundamental result [10]–[13] stating thatevery function has a nomographic representation . These rep-resentations allow us to exploit the superposition propertyof the wireless channel for an efficient function computationover the “air”. The superposition (or broadcast) property isthe ability of the wireless channel to “form” (noisy) linearcombinations of information-bearing symbols transmitted bydifferent nodes. This property is usually seen as a problemin traditional wireless networks where it is the source forinterference between independent concurrent transmissions. Ifin contrast, nodes cooperate for a common goal which is afunction computation, then the superposition property is ingeneral not a source for interference but rather for “inference” which should be exploited to compute functions. Whether thesuperposition property can be exploited for performance gainsstrongly depends on errors and uncertainties introduced by thewireless channel. There are at least two main sources of errorsand uncertainty on the communication side that need to beproperly addressed to pave the way for practical applications ofthe ideas:
Noise and fading . For general functions, the impactof noise can severely deteriorate the system performance.Therefore, in this paper, we focus on the class F mon consistingof bounded and measurable nomographic functions, where theouter function is restricted in a way that allows for controllingthe impact of noise and fading introduced by the channel.We derive bounds on the number of channel uses needed toapproximate a function in F mon up to a desired accuracy anddeal with a channel model that may display a certain amountof correlation in noise and fading between users and in time. A. Prior Work
Analog uncoded approximation of functions first appearedin [14]. This work assumes known source distributions andadditive white Gaussian noise channels without fading for theachievability theorems and the class of approximated functionsis constrained to the linear case. The idea has been picked upin [15]–[17] and extended to a more general class of functions.These works consider the noiseless case as well as the case with noise, but without fast fading, providing asymptotic errorbounds.Distributed computation of functions with coding has beenintroduced in [6] with applications in network coding. Theoriginal idea has been refined, expanded and applied in severalmore recent works (e.g., [18]–[22]). These works focus onthe application of network coding and therefore the case inwhich the same function (an addition in a finite field) is tobe computed repeatedly. The coding approach used ensurescomputation of the discrete functions with an arbitrarily smallerror as long as the computation rate is not too high.Pre- and post-processing schemes for function approxi-mation over fast-fading channels appeared in [23], but notheoretical guarantees are derived. In [1], we derived suchtheoretical guarantees for the case of independent fading andnoise. The authors of [24] derived theoretical bounds onthe mean squared error in over-the-air function computationin a fast-fading scenario with channel state information atthe transmitter. The work also considers the case of multi-antenna transceivers and provides empirical results along withthe theoretical bounds. In [25], the over-the-air computationproblem is approached without explicit channel estimationunder the assumption of slow fading. The work also considersintersymbol interference and provides theoretical analysis aswell as numerical results.The direct application of over-the-air computation tech-niques to distributed gradient descent has received a lot ofattention recently since this can be used to solve the empiricalrisk minimization problem for ML models such as neural net-works in the case of distributed training data without having tocollect the training data at a central point. [26], [27] propose toextend the Federated Learning paradigm [28], [29] to make useof over-the-air computation over wireless channels and providetheoretical analysis along with empirical results to this end.There are also extensions of this idea to channels with fadingchannel information at either the transmitter or receiver [30]–[35], often taking additional aspects into consideration such asdifferential privacy [36] and multi-antenna scenarios [37].
B. Contributions of this Paper
Contrary to the prior work, we do not assume a particularsource distribution on the transmitted messages, but deriveuniform bounds that hold independently of how the sourcesare distributed. We also focus on one-shot approximation offunction values, which is in contrast to the scenario in whichthe same function is computed repeatedly, as is the case inthe works that focus on the application of network coding.Our channel model is more general than in prior works, sincewe do not only consider sub-gaussian fast fading and noise,but also allow for limited correlations between them. Forthe ML applications that we present, we focus on generalscenarios in cases for which we can provide theoreticallyproven performance guarantees, while existing works deal withmore complex, but also less general applications to ML.To summarize, our main contributions in this work are1) a detailed technical analysis of a method of distributedapproximation of functions in F mon in a multiple-accesssetting with fast fading and additive noise accountingfor correlations between users as well as in time, ××× ...... H K H H + + Nx K x x Y Fig. 1. Channel model.
2) the treatment of sub-gaussian fading and noise, gener-alizing the Gaussian case so as to accommodate manyfading and noise distributions that occur in practice,3) applications of these techniques to a subclass of MLmodels for distributed prediction in regression problemsin Section III and an approach to distributed trainingand prediction in a model-agnostic approach based onBoosting in Section IV.II. S
YSTEM M ODEL AND P ROBLEM S TATEMENT
A. Sub-Gaussian Random Variables
We begin with a short overview of the relevant definitionsand properties of sub-gaussian random variables. More on thistopic can be found in Appendix A and in [38]–[40].For a random variable X , we define τ ( X ) := inf n t > ∀ λ ∈ RE exp ( λ ( X − E X )) ≤ exp (cid:0) λ t / (cid:1) o . (1) X is called a sub-gaussian random variable if τ ( X ) < ∞ . Thefunction τ ( · ) defines a semi-norm on the set of sub-gaussianrandom variables [38, Theorem 1.1.2], i.e., it is absolutelyhomogeneous, satisfies the triangle inequality, and is non-negative. τ ( X ) = 0 does not necessarily imply X = 0 unlesswe identify random variables that are equal almost everywhere.Examples of sub-gaussian random variables include Gaussianand bounded random variables. B. System Model
We consider the following channel model with K transmittersand one receiver, depicted in Fig. 1: For m = 1 , . . . , M , thechannel output at the m -th channel use is given by Y ( m ) = K X k =1 H k ( m ) x k ( m ) + N ( m ) . (2)Here and hereafter, the notation is defined as follows: Note that other norms on the space of sub-gaussian random variables thatappear in the literature are equivalent to τ ( · ) (see, e.g., [38]). The particulardefinition we choose here matters, however, because we want to derive resultsin which no unspecified constants appear. E MK E M E M M -foldchannel D M ...... s K s s x MK x M x M Y M ˜ f Fig. 2. System model. • x k ( m ) ∈ C is a transmit symbol. We assume a peakpower constraint | x k ( m ) | ≤ P for k = 1 , . . . , K and m = 1 , . . . , M . • H k ( m ) , k = 1 . . . , K , m = 1 , . . . , M , are complex-valued random variables such that for every m =1 , . . . , M and k = 1 , . . . , K , the real part H rk ( m ) andthe imaginary part H ik ( m ) of H k ( m ) are sub-gaussianrandom variables with mean zero and variance . • N ( m ) , m = 1 , . . . , M , are complex-valued randomvariables. We assume that the real and imaginary parts N r ( m ) , N i ( m ) of N ( m ) are sub-gaussian random vari-ables with mean zero for m = 1 , . . . , M . Definition 1.
We say that the fading is user-uncorrelated iffor every k = k , j ∈ { i, r } and m , the random variables H jk ( m ) and H jk ( m ) are independent. We allow limited dependence in the fading and noise. Moreprecisely, we allow the fading coefficients and additive noiseinstances to be linear combinations of the entries of a commonrandom base vector with independent, sub-gaussian entries. Inorder to be able to apply a variation of the Hanson-Wrightinequality as a tool, we give the formal description of ourdependency model in terms of matrices and vectors with realentries.We define H := ( H (1) , . . . , H (2 M )) T (3)where for m = 1 , . . . , M , H (2 m −
1) := ( H r ( m ) , . . . , H rK ( m )) H (2 m ) := ( H i ( m ) , . . . , H iK ( m )) . So H is the vector of all fading coefficients. Similarly, let N := ( N r (1) , N i (1) , . . . , N r ( M ) , N i ( M )) T (4)be the vector of all the instances of additive noise. Thedependence model we consider is such that there is a vector R of (2 KM + 2 M ) independent random variables with sub-gaussian norm at most and matrices A ∈ R KM × (2 KM +2 M ) and B ∈ R M × (2 KM +2 M ) such that H = AR and N = BR . Remark 1.
While the class of fading and noise distributionsdefined by this does not contain arbitrarily dependent sub-gaussian fading and noise, it does contain arbitrarily depen-dent Gaussian fading and noise as a special case.
Remark 2.
Obviously, user-uncorrelated fading can be char- acterized based on the form of A . If we write A = A (1) ... A (2 M ) , where for all m , A ( m ) ∈ R K × (2 MK +2 M ) , then H = AR defines user-uncorrelated fading for all R iff each A ( m ) has atmost one nonzero entry per column. This is because H ( m ) = A ( m ) R and therefore two or more nonzero entries in a columnof A ( m ) mean that two or more entries in H ( m ) depend onthe same entry in R .C. Distributed Approximation of Functions Our goal is to approximate functions f : S × . . . ×S K → R ina distributed setting. The sets S , . . . S K ⊆ R are assumed tobe closed and endowed with their natural Borel σ -algebras B ( S ) , . . . , B ( S K ) , and we consider the product σ -algebra B ( S ) ⊗ . . . ⊗ B ( S K ) on the set S × . . . × S K . Furthermore,the functions f : S × . . . × S K → R under consideration areassumed to be measurable in what follows. Definition 2.
An admissible distributed function approxima-tion scheme (DFA) for f : S × . . . ×S K → R with M channeluses, depicted in Fig. 2, is a pair ( E M , D M ) , consisting of: A pre-processing function E M = ( E M , . . . , E MK ) , whereeach E Mk is of the form E Mk ( s k ) = ( x k ( m, s k , U k ( m ))) Mm =1 ∈ C M with random variables U k (1) , . . . , U k ( M ) and a mea-surable map ( s k , t , . . . , t M ) ( x k ( m, s k , t m )) Mm =1 ∈ C M . The encoder E Mk is subject to the peak power constraint | x k ( m, s k , U k ( m )) | ≤ P for all k = 1 , . . . , K and m = 1 , . . . , M . A post-processing function D M : The receiver is allowedto apply a measurable recovery function D M : C M → R upon observing the output of the channel. So in order to approximate f , the transmitters apply theirpre-processing maps to ( s , . . . , s K ) ∈ S × . . . ×S K resultingin E M ( s ) , . . . , E MK ( s K ) which are sent to the receiver usingthe channel M times. The receiver observes the output of thechannel and applies the recovery map D M . The whole processdefines an estimate ˜ f of f .Let ε > , δ ∈ (0 , and f : S × . . . × S K → R be given.We say that f is ε -approximated after M channel uses withconfidence level δ if there is a DFA ( E M , D M ) such that theresulting estimate ˜ f of f satisfies P ( | ˜ f ( s K ) − f ( s K ) | ≥ ε ) ≤ δ (5)for all s K := ( s , . . . , s K ) ∈ S × . . . × S K . Let M ( f, ε, δ ) denote the smallest number of channel uses such that there isan approximation scheme ( E M , D M ) for f satisfying (5). Wecall M ( f, ε, δ ) the communication cost for approximating afunction f with accuracy ε and confidence δ . D. The class of functions to be approximated
A measurable function f : S × . . . × S K → R is calleda generalized linear function if there are bounded measurablefunctions ( f k ) k ∈{ ,...,K } , with f ( s , . . . , s K ) = P Kk =1 f k ( s k ) , for all ( s , . . . , s K ) ∈ S × . . . × S K . The set of generalizedlinear functions from S × . . . × S K → R is denoted by F K, lin .Our main object of interest will be the following class offunctions. Definition 3.
A measurable function f : S × . . . ×S K → R issaid to belong to F mon if there exist bounded and measurablefunctions ( f k ) k ∈{ ,...,K } , a measurable set D ⊆ R with theproperty f ( S ) + . . . + f K ( S K ) ⊆ D , a measurable function F : D → R such that for all ( s , . . . , s K ) ∈ S × . . . × S K we have f ( s , . . . , s K ) = F K X k =1 f k ( s k ) ! , (6) and there is a strictly increasing function Φ : [0 , ∞ ) → [0 , ∞ ) with Φ(0) = 0 and | F ( x ) − F ( y ) | ≤ Φ( | x − y | ) (7) for all x, y ∈ D . We call the function Φ an increment majorant of f . Some examples of functions in F mon are:1) Obviously, all f ∈ F K, lin belong to F mon .2) For any f ∈ F K, lin and B -Lipschitz function F : R → R we have F ◦ f ∈ F mon with Φ : [0 , ∞ ) → [0 , ∞ ) , x Bx .3) If f ∈ F K, lin and F is ( C, α ) -H¨older continuous, i.e.,for all x, y in the domain of F , (cid:12)(cid:12) F ( x ) − F ( y ) (cid:12)(cid:12) ≤ C (cid:12)(cid:12) x − y (cid:12)(cid:12) α , then F ◦ f ∈ F mon with Φ : x Cx α .4) For any p ≥ and S , . . . , S K compact, || · || p ∈ F mon .In this example we have f k ( s k ) = | s k | p , k = 1 , . . . , K , F : [0 , ∞ ) → [0 , ∞ ) , x x p , and F = Φ .This can be seen as follows. We have to show that forall nonnegative x, y ∈ R and p ≥ we have | x p − y p | ≤ | x − y | p . (8)We can assume w.l.o.g. that x < y holds. Then since | x p − y p | = | y | p − (cid:18) xy (cid:19) p ! it suffices to prove that for all a ∈ [0 , and p ≥ wehave − a p ≤ (1 − a ) p . Now since a p + (1 − a ) p ≥ a +(1 − a ) = 1 for a ∈ [0 , and p ≥ , we can concludethat (8) holds.We are now in a position to state our main theorem onapproximation of functions in F mon . To this end, we introducethe notion of total spread of the inner part of f ∈ F mon as ¯∆( f ) := K X k =1 ( φ max ,k − φ min ,k ) , (9)along with the max -spread ∆( f ) := max ≤ k ≤ K ( φ max ,k − φ min ,k ) , (10)where φ min ,k := inf s ∈S k f k ( s ) , φ max ,k := sup s ∈S k f k ( s ) . (11)We define the relative spread with power constraint P as ∆( f k P ) := P · ¯∆( f )∆( f ) . (12) We use k·k and k·k F to denote the operator and Frobeniusnorm of matrices, respectively. Theorem 1.
Let f ∈ F mon , M ∈ N , and the power constraint P ∈ R + be given. Let Φ be an increment majorant of f .Assume the fading and noise are correlated as determinedby matrices A and B . Let A i ∈ R MK × (2 MK +2 M ) bea matrix which generates user-uncorrelated fading and let A U ∈ R (2 MK +2 M ) × (2 MK +2 M ) be a unitary matrix thatapproximate A in the sense that k ( A + A i A U )( A − A i A U ) T k ≤ η. Then there exist pre- and post-processing operations such that P (cid:0)(cid:12)(cid:12) ¯ f − f ( s , . . . , s K ) (cid:12)(cid:12) ≥ ε (cid:1) ≤ (cid:18) − M Φ − ( ε ) F + D + 4Φ − ( ε ) L (cid:19) + 2 exp (cid:18) − M Φ − ( ε ) F + 32Φ − ( ε ) L (cid:19) , (13) where L = q ¯∆( f ) k A k + r ∆( f ) P k B k ! F = L r ¯∆( f ) M k A k F + r ∆( f ) P M k B k F ! D = (cid:18) √ M ¯∆( f ) η + 4 ∆( f ) √ P M k AB T k F (cid:19) . Remark 3.
If no suitable approximation for A of the form A i A U is available, we can always choose A i := 0 and A U := id , which results in η = k A k . Corollary 1.
In the setting of Theorem 1 with uncorrelatedfading and noise, i.e., A := (cid:0) σ F id MK (cid:1) , B := (cid:0) σ N id M (cid:1) , we have P (cid:0)(cid:12)(cid:12) ¯ f − f ( s , . . . , s K ) (cid:12)(cid:12) ≥ ε (cid:1) ≤ (cid:18) − M Φ − ( ε ) F ′ + 4Φ − ( ε ) L ′ (cid:19) + 2 exp (cid:18) − M Φ − ( ε ) F ′ + 32Φ − ( ε ) L ′ (cid:19) , where L ′ = q ¯∆( f ) σ F + r ∆( f ) P σ N ! F ′ = L ′ q K ¯∆( f ) σ F + r f ) P σ N ! . Proof.
Note that AB T = 0 , k A k = σ F , k B k = σ N , k A k F = √ M Kσ F and k B k F = √ M σ N ; pick A i := A and A U := id and substitute this into (13). Corollary 2.
For the approximation communication cost, wehave M ( f, ε, δ ) ≤ log 4 − log δ Φ − ( ε ) Γ , (14) where Γ := max (cid:0) F + D + 4Φ − ( ε ) L, F + 32Φ − ( ε ) L (cid:1) . Proof.
We upper bound (13) as P ( | ¯ f ( s K ) − f ( s K ) | ≥ ε ) ≤ (cid:18) − M Φ − ( ε ) Γ (cid:19) , and solve the expression for M concluding the proof. Remark 4. If F is C -Lipschitz continuous, we can replace Φ − ( ε ) in (14) and the expression for Γ with ε/C . III. D
ISTRIBUTED F UNCTION A PPROXIMATION IN
MLIn this section, we discuss how the methods described in thispaper can be used to compute estimators of support vectormachines (SVM) in a distributed fashion. First, we brieflysketch the setting as in [41]. We consider a feature alphabet X , a label alphabet Y ⊆ R and a probability distribution P on X × Y which is in general unknown. A statistical inferenceproblem is characterized by the feature alphabet, the labelalphabet and a loss function L : X × Y × R → [0 , ∞ ) . Theobjective is, given training samples drawn i.i.d. from P , tofind an estimator function f : X → R such that the risk R L, P := E P L ( X, Y, f ( X )) is as small as possible. In orderfor the risk to exist, we must impose suitable measurabilityconditions on L and f . In this paper, we deal with Lipschitz-continuous losses. We say that the loss L is B -Lipschitz-continuous if L ( x, y, · ) is Lipschitz-continuous for all x ∈ X and y ∈ Y with a Lipschitz constant uniformly boundedby B . Lipschitz-continuity of a loss function is a propertythat is also often needed in other contexts. Fortunately, manyloss functions of practical interest possess this property. Forinstance, the absolute distance loss, the logistic loss, the Huberloss and the ε -insensitive loss, all of which are commonlyused in regression problems [41, Section 2.4], are Lipschitz-continuous. Even in scenarios in which the naturally arisingloss is not Lipschitz-continuous, for the purpose of designingthe ML model, it is often replaced with a Lipschitz-continuousalternative. For instance, in binary classification, we have Y = {− , } and the loss function is given by ( x, y, t ) ( , sign( y ) = sign( t )1 , otherwise.This loss is not even continuous, which makes it hard to dealwith. So for the purpose of designing the ML model, it iscommonly replaced with the Lipschitz-continuous hinge lossor logistic loss [41, Section 2.3].Here, we consider the case in which the features are K -tuples and the SVM can be trained in a centralized fashion.The actual predictions, however, are performed in a distributedsetting; i.e., there are K users each of which observes onlyone component of the features. The objective is to make anestimate of the label available at the receiver while using aslittle communication resources as possible.To this end, we consider the case of additive models whichis described in [42, Section 3.1]. We have X = X × · · · × X K and a kernel κ k : X k ×X k → R with an associated reproducing kernel Hilbert space H k of functions mapping from X k to R for each k ∈ { , . . . , K } . Then by [42, Theorem 2] κ : X × X → R , (( x , . . . , x K ) , ( x ′ , . . . , x ′ K )) κ ( x , x ′ ) + · · · + κ K ( x K , x ′ K ) (15)is a kernel and the associated reproducing kernel Hilbert spaceis H := { f + · · · + f K : f ∈ H , . . . , f K ∈ H K } . (16)So this model is appropriate whenever the function to beapproximated is expected to have an additive structure. Weknow [41, Theorem 5.5] that an SVM estimator has the form f ( x ) = N X n =1 α n κ ( x, x n ) , (17)where α , . . . , α N ∈ R and x , . . . x N ∈ X . In our additivemodel, this is f ( x , . . . , x k ) = K X k =1 f k ( x k ) , (18)where for each k , f k ( x k ) = N X n =1 α n κ k ( x k , x nk ) . (19)We now state a result for the distributed approximationof the estimator of such an additive model as an immediateconsequence of Theorem 1. Corollary 3.
Consider an additive ML model, i.e., we have anestimator of the form (18), and assume that L is a B -Lipschitz-continuous loss. Suppose further that all the f K have boundedrange such that the quantities ¯∆( f ) and ∆( f ) as defined in (9)and (10) exist and are finite. Let ε, δ > and M ≥ M ( f, ε, δ ) as defined in (14), where Φ := id and thus Φ − ( ε ) = ε .Then, given any x K = ( x , . . . , x K ) at the transmitters andany y ∈ Y , through M uses of the channel (2), the receivercan obtain an estimate ¯ f of f ( x K ) satisfying P ( (cid:12)(cid:12) L ( x K , y, ¯ f ) − L ( x K , y, f ( x K )) (cid:12)(cid:12) ≥ Bε ) ≤ δ. (20) Proof.
The Lipschitz continuity of L yields P ( (cid:12)(cid:12) L ( x K , y, ¯ f ) − L ( x K , y, f ( x K )) (cid:12)(cid:12) ≥ Bε ) ≤ P ( (cid:12)(cid:12) ¯ f − f ( x K ) (cid:12)(cid:12) ≥ ε ) , from which (20) follows by the definition of M ( f, ε, δ ) .We conclude this section with a brief discussion of thefeasibility of the condition that f , . . . , f K have boundedranges in the case of the additive SVM model discussed above.The coefficients α , . . . , α N are a result of the training stepand can therefore be considered constant, so all we need is thatthe ranges of κ , . . . , κ K are bounded. This heavily dependson X , . . . , X K and the choices of the kernels, but we remarkthat the boundedness criterion is satisfied in many cases ofinterest. The range of Gaussian kernels is always a subset of (0 , , and while other frequent choices such as exponential,polynomial and linear kernels can have arbitrarily large ranges,they are nonetheless continuous which means that as long asthe input alphabets are compact topological spaces (e.g., closedhyperrectangles or balls), the ranges are also compact, andtherefore bounded. IV. A PPLICATION TO B OOSTING
In this section, we discuss how boosting techniques can beused to apply results from this work to distributed predictionand training for binary classification problems. This is morespecific than the considerations in Section III in the sense thatwe concentrate on binary classification, but it is more generalin the sense that the approach discussed here does not assumean underlying additive model and that it works regardless ofwhat tools the nodes employ to make their local predictions.Therefore, each node can choose an ML model based, e.g., onits computational capabilities and the nature of the features itobserves.We consider a feature alphabet X = X × · · · × X K and alabel alphabet Y = {− , } as well as an unknown, but fixedprobability distribution P on X × Y . In the training phase,each user k observes S training samples (cid:16)(cid:16) x (1) k , y (1) (cid:17) , . . . , (cid:16) x ( S ) k , y ( S ) (cid:17)(cid:17) , where for all k, s , we have x ( s ) k ∈ X k , y ( s ) ∈ Y and ( x ( s )1 , . . . , x ( s ) K , y ( s ) ) is drawn according to P .Each user k can train its own model based on its locallyavailable training sample which is drawn from the marginal of P with respect to X k × Y . We propose to use a slight variationof the well-known boosting technique and define a classifier g := K X k =1 α k g k , (21)where g k is the base classifier locally trained at user k and α k is a nonnegative weight. As an immediate corollary toTheorem 1 parallel to Corollary 3, g can be approximated ata central node in a distributed manner. Corollary 4.
Assume that L is a B -Lipschitz-continuous loss.Let ε, δ > and M ≥ M ( g, ε, δ ) as defined in (14), where Φ − ( ε ) = ε , noting that ¯∆( g ) = 2 K X k =1 α k , ∆( g ) = 2 K max k =1 α k . Then, given any x K = ( x , . . . , x K ) at the transmitters andany y ∈ Y , through M uses of the channel (2), the receivercan obtain an estimate ¯ g of g ( x K ) satisfying P ( (cid:12)(cid:12) L ( x K , y, ¯ g ) − L ( x K , y, g ( x K )) (cid:12)(cid:12) ≥ Bε ) ≤ δ. (22)The proof is the same as for Corollary 3.This is a relatively generic framework that can in principlework with any particular boosting technique which determinesweights α , . . . , α K and guarantees a bound on the loss ofthe predictor g dependent on the errors of the base classifiers g , . . . , g K . Of course, there are some problems specific to theactual boosting technique employed which we have not yetconsidered. Firstly, the algorithm that determines α , . . . , α K and possibly also modifications to the local training procedures(e.g., a reweighting of the training samples in the empiricaldistribution used) is usually designed to run centrally andadopting it to the distributed setting can incur significantcommunication cost. Secondly, the predictor g can only beapproximated at the receiver up to a residual error (whichcan, however, be controlled) and thus, a guarantee in terms of the - -loss is not sufficient to apply Corollary 4. Instead, weneed it to be in terms of a Lipschitz-continuous loss.In the remainder of this section, we provide an example ofhow to address these points in the case of the often employedAdaBoost algorithm. To this end, we adapt the standardscheme as in [43, Figure 6.1] to the distributed setting. Thealgorithm runs through T ≤ K iterations, choosing a user h t at iteration t to provide a base classifier g h t and assigning acorresponding weight α h t . It also computes probability dis-tributions D , . . . , D T +1 on the index set of the training data { , . . . , S } , initializing D as the uniform distribution, as wellas base classifier errors ǫ , . . . , ǫ T and normalization constants Z , . . . , Z T . Each iteration t consists of the following steps:1) The central node chooses a user h t and broadcasts thechoice.2) User h t trains a base classifier g h t : X h t → {− , } onthe training sample with distribution D t and broadcaststhe indices of the training samples incorrectly classifiedby g h t .3) From this information, every node in the system com-putes the following: • ǫ t := P Ss =1 D t ( s ) g ht ( x ( s ) ht ) = y ( s ) • α h t := log − ǫ t ǫ t • Z t := 2 p ǫ t (1 − ǫ t )) • D t +1 ( s ) := D t ( s ) exp( − α h t g h t ( x ( s ) h t ) y ( s ) ) /Z t The resulting classifier is then as defined in (21), where weassign α k := 0 whenever k = h t for all t . [43, Theorem 6.1]guarantees that the empirical - -loss of g is at most exp − T X t =1 (cid:18) − ǫ t (cid:19) ! , (23)which unfortunately is insufficient to apply Corollary 4, be-cause the - -loss is not Lipschitz-continuous. However, theproof of the theorem relies only on the inequality g ( x K ) y ≤ ≤ exp( − g ( x K ) y ) for the instantaneous - -loss. Since the in-equality log(1 + exp( − g ( x K ) y )) ≤ exp( − g ( x K ) y ) alsoclearly holds, we can replace the - -loss in the proof withthe logistic loss L ( x K , y, ˆ y ) := log(1+exp( − y ˆ y )) (or, indeed,any other loss which satisfies this inequality). This yields thesame bound (23) on the -Lipschitz-continuous logistic lossand thus we can apply Corollary 4 with B := 1 to derive aguarantee on the logistic loss of the distributed approximationof our AdaBoost classifier.We conclude with some remarks on the distributed training.The choice in step 1 could, e.g., be predetermined (in whichcase no communication in this step is necessary) or random,but we could also greedily select the classifier with smallesterror using an instance of ScalableMax [44][1, Section IV].As for the communication cost of the distributed training,step 1 exhibits a favorable scaling which is linear in T and logarithmic in K , however, step 2 has a cost linearin the number of training samples. There is a conceptuallysimpler alternative to this distributed scheme in which wecommunicate the full training set to the central node andperform the training in a centralized manner. The advantagein communication cost of the distributed scheme over thiscentralized alternative is only a constant factor. On the otherhand, since only one bit per training sample and user is transmitted, this constant gain could potentially be quite large,depending on the complexity of the feature spaces. Also, in thedistributed training scheme, the computational load of trainingthe base classifiers is distributed across all nodes which mayin practice also be an advantage wherever the computationalcapabilities of the central node are limited.V. P ROOF OF T HEOREM A. Pre-Processing
In the pre-processing step we encode the function values f k ( s k ) , k = 1 , . . . , K as transmit power: X k ( m ) := √ a k U k ( m ) , ≤ m ≤ M with a k = g k ( f k ( s k )) , where g k : [ φ min ,k , φ max ,k ] → [0 , P ] such that g k ( t ) := P ∆( f ) ( t − φ min ,k ) , (24)where ∆( f ) is given in (10) and φ min ,k is defined in (11). U k ( m ) , k = 1 , . . . , K , m = 1 , . . . , M are i.i.d. with theuniform distribution on {− , +1 } . We assume the randomvariables U k ( m ) , k = 1 , . . . , K , m = 1 . . . , M , are indepen-dent of H k ( m ) , k = 1 , . . . , K , m = 1 , . . . , M , and N ( m ) , m = 1 , . . . , M .We write the vector of transmitted signals at channel use m as X ( m ) := ( X ( m ) , . . . , X K ( m )) and combine them in a matrix as Q := X (1) 0 0 0 0 . . . X (1) 0 0 0 . . .
00 0 X (2) 0 0 . . .
00 0 0 X (2) 0 . . . . . . . . . X ( M ) 00 0 0 . . . X ( M ) . B. Post-Processing
The vector Y of received signals across the M channel usescan be written as Y = Q · H + N, where H and N are givenin (3) and (4). The post-processing is based on receive energywhich has the form k Y k = Y T Y = ( QAR + BR ) T ( QAR + BR ) = R T CR, where we use C := ( QA + B ) T ( QA + B )= A T Q T QA + A T Q T B + B T QA + B T B. (25)Equivalently, we can phrase this as k Y k = K X k =1 a k k H k k + ¯ N s K , (26)where H k = ( H k (1) , . . . , H k ( M )) is a vector consisting ofcomplex fading coefficients, and ¯ N s K = P Mm =1 ¯ N s K ( m ) . The random variables ¯ N s K ( m ) , m = 1 , . . . , M , are given by ¯ N s K ( m ) := K X k,l =1 ,k = l √ a k a l H k ( m ) H l ( m ) × U k ( m ) U l ( m )+ 2 Re N ( m ) K X k =1 √ a k H k ( m ) U k ( m ) ! + | N ( m ) | . (27)The receiver computes its estimate ¯ f of f ( s , . . . , s K ) as ¯ f := F (¯ g ( k Y k − E k N k )) , where ¯ g ( t ) := ∆( f )2 · M · P t + K X k =1 φ min ,k . C. The Error Event
Clearly, E ¯ N s K ( m ) = E | N ( m ) | (since all the other sum-mands in (27) are centered). We can therefore conclude E (cid:0) ¯ g (cid:0) k Y k − E k N k (cid:1)(cid:1) = ¯ g (cid:0) E k Y k − E k N k (cid:1) = K X k =1 f k ( s k ) . We use this to argue (cid:12)(cid:12) ¯ f − f ( s , . . . , s K ) (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) F (cid:0) ¯ g (cid:0) k Y k − E k N k (cid:1)(cid:1) − F K X k =1 f k ( s k ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Φ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ¯ g (cid:0) k Y k − E k N k (cid:1) − K X k =1 f k ( s k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)! = Φ (cid:0)(cid:12)(cid:12) ¯ g (cid:0) k Y k − E k N k (cid:1) − ¯ g (cid:0) E k Y k − E k N k (cid:1)(cid:12)(cid:12)(cid:1) = Φ (cid:18) ∆( f )2 M P (cid:12)(cid:12) k Y k − E k Y k (cid:12)(cid:12)(cid:19) and therefore P (cid:0)(cid:12)(cid:12) ¯ f − f ( s , . . . , s K ) (cid:12)(cid:12) ≥ ε (cid:1) ≤ P (cid:18)(cid:12)(cid:12) k Y k − E k Y k (cid:12)(cid:12) ≥ M P ∆( f ) Φ − ( ε ) (cid:19) . (28) D. Performance Bounds
Our objective is now to establish the concentration of k Y k around its expectation and thus obtain an upper bound for theright hand side of (28). To this end, we first need to establisha series of lemmas that we will use as tools.We will split the deviation from the mean into a diagonaland an off-diagonal part. The first lemma will later help usbound the diagonal part of the error. Lemma 1.
Let X , . . . , X n be independent and centered withsubgaussian norm at most . Let A , . . . , A n be random vari-ables independent of X , . . . , X n but not necessarily of eachother, and assume that for all k , | A k | ≤ ˜ L and P nk =1 A k ≤ ˜ F almost surely. Then we have for any c ∈ (0 , and any λ ∈ ( − c/ (2 ˜ L ) , c/ (2 ˜ L )) , E exp λ n X k =1 (cid:0) A k X k − E ( A k X k ) (cid:1)! ≤ exp λ · F − c ! E exp λ n X k =1 (cid:0) A k − E ( A k ) (cid:1)! . Proof.
The lemma follows by a straightforward calculation E exp λ n X k =1 (cid:0) A k X k − E ( A k X k ) (cid:1)! = E exp λ n X k =1 (cid:0) A k ( X k − E ( X k )) (cid:1)! · exp λ n X k =1 (cid:0) E ( X k )( A k − E ( A k ) (cid:1)! ! (29) = E A exp λ n X k =1 (cid:0) E ( X k )( A k − E ( A k ) (cid:1)! · n Y k =1 E X exp (cid:0) ( λA k ) (cid:0) X k − E ( X k ) (cid:1)(cid:1) ! (30) ≤ E A exp λ n X k =1 (cid:0) E ( X k )( A k − E ( A k ) (cid:1)! · n Y k =1 exp (cid:18) λ · A k − c (cid:19) ! (31) ≤ exp λ · F − c ! E exp λ n X k =1 ( A k − E A k ) !! , (32)where (30) follows by the independence assumptions, (31)is an application of Lemma 7 and (32) holds because P nk =1 A k ≤ ˜ F almost surely.The next lemma is a slight variation of the Hanson-Wrightinequality as phrased in [40, Theorem 6.2.1] and will help usbound the off-diagonal part of the error. Lemma 2.
Let X be an R n -valued random variable withindependent, centered entries and assume that for all k ∈{ , . . . , n } , the k -th entry of X satisfies τ ( X k ) ≤ K . Let A ∈ R n × n with zeros on the diagonal and ε > . Suppose furtherthat k A k ≤ A op and k A k F ≤ A F . Then E (cid:0) X T AX (cid:1) = 0 and P (cid:0)(cid:12)(cid:12) X T AX (cid:12)(cid:12) ≥ ε (cid:1) ≤ (cid:18) − ε K εA op + 256 K A (cid:19) . (33)This lemma differs from [40, Theorem 6.2.1] mainly in thatwe require the diagonal entries of A to be and that all theconstants are explicit. The proof follows [40] closely and isgiven in Appendix B. We remark that it is not hard to followthe proof in [40] further and expand the result to matrices withnon-zero diagonal elements, however, this is not relevant forthe present work.Mainly because the matrix C contains randomness, we needa slight modification of this lemma as well as two morelemmas exploring some specific properties of C . Corollary 5.
Assume a setting as in Lemma 2, but let A bean R n × n -valued random variable independent of X such thatalmost surely, the diagonal entries of A are , k A k ≤ A op and k A k F ≤ A F . Then E (cid:0) X T AX (cid:1) = 0 and (33), consideringjoint expectation, respectively probability of X and A , stillhold.Proof. E (cid:0) X T AX (cid:1) = 0 as well as (33) hold conditional onany realization of A (except possibly in a null set) and there-fore, the Corollary follows by the laws of total expectationand total probability. Lemma 3.
We have almost surely k C k F ≤ (cid:16)p ∆( f || P ) k A k + k B k (cid:17) · (cid:16)p ∆( f || P ) k A k F + k B k F (cid:17) k C k ≤ (cid:16)p ∆( f || P ) k A k + k B k (cid:17) . Proof.
In order to bound the norm of C , we first note that QQ T = K X k =1 a k id M . (34)Therefore, we can conclude that all singular values of Q arebounded by p ∆( f || P ) and thus k Q k ≤ p ∆( f || P ) .Noting that k XY k F ≤ k X kk Y k F for all matrices X, Y ofcompatible dimensions and further noting the submultiplica-tivity of the operator norm and the triangle inequality of bothnorms, we get k C k F ≤ k QA + B kk QA + B k F ≤ ( k Q kk A k + k B k ) ( k Q kk A k F + k B k F ) ≤ (cid:16)p ∆( f || P ) k A k + k B k (cid:17) · (cid:16)p ∆( f || P ) k A k F + k B k F (cid:17) k C k = k QA + B k ≤ ( k Q kk A k + k B k ) ≤ (cid:16)p ∆( f || P ) k A k + k B k (cid:17) Lemma 4.
We have τ (tr C ) ≤ M ∆( f || P ) k ( A + A i A U )( A − A i A U ) T k + 2 √ P k AB T k F . (35) Proof.
With an addition of zero, we can rewrite tr (cid:0) A T Q T QA (cid:1) = tr (cid:0) A T Q T QA (cid:1) + tr (cid:0) A T Q T QA i A U (cid:1) − tr (cid:16) ( A i A U ) T Q T QA (cid:17) − tr (cid:16) ( A i A U ) T Q T QA i A U (cid:17) + tr (cid:16) ( A i A U ) T Q T QA i A U (cid:17) = tr (cid:16) ( A − A i A U ) T Q T Q ( A + A i A U ) (cid:17) + tr (cid:16) ( A i A U ) T Q T QA i A U (cid:17) and use this together with (25) to conclude tr C =tr (cid:16) ( A − A i A U ) T Q T Q ( A + A i A U ) (cid:17) + 2tr (cid:0) B T QA (cid:1) + tr (cid:16) ( A i A U ) T Q T QA i A U (cid:17) + tr (cid:0) B T B (cid:1) . (36) Next, we argue that the terms in the last line are almost surelyconstant. For tr( B T B ) this is immediately clear. Moreover,we have tr (cid:16) ( A i A U ) T Q T QA i A U (cid:17) = tr (cid:0) A iT Q T QA i A U A U T (cid:1) = k QA i k F . We note that as per Remark 2 and using correspondingnotation, we have QA i = X (1) A (1) i . . . X (1) A (2) i . . . . . . . . . X ( M ) A (2 M − i . . . X ( M ) A (2 M ) i and because each A ( m ) i has only one nonzero entry percolumn, each entry of QA i is the product of U k ( m ) with adeterministic term for some m, k and therefore, its square cantake only one value almost surely, and consequently, k QA i k F also takes only one value almost surely.We can use this in (36) and incorporate the triangle inequal-ity to obtain τ (tr C ) ≤ τ ( ξ ) + 2 τ ( ξ ) , where ξ := tr (cid:16) ( A − A i A U ) T Q T Q ( A + A i A U ) (cid:17) ξ := tr (cid:0) B T QA (cid:1) . To the end of bounding τ ( ξ ) , we argue ξ = tr (cid:16) Q ( A + A i A U )( A − A i A U ) T Q T (cid:17) ≤ k ( A + A i A U )( A − A i A U ) T k tr (cid:0) QQ T (cid:1) ≤ M ∆( f || P ) k ( A + A i A U )( A − A i A U ) T k . The first inequality holds because for any square matrix X and compatible column vector v , we have v T ( k X k id − X ) v = k v k k X k − (cid:18) v k v k (cid:19) T X v k v k ! ≥ (see, e.g., [45, Exercise I.2.10]) and therefore k X k id − X ispositive semidefinite. The second inequality directly followsfrom (34). It follows, e.g., from [38, Example 1.2], that τ ( ξ ) is upper bounded by the first summand on the right hand sidein (35).In order to bound the sub-gaussian norm of ξ , we viewit as a function of ( U k ( m )) K,Mk,m =1 and use part of the proofof the Bounded Differences Inequality [46, Theorem 6.2] tobound the moment generating function. To this end, we define ( E i,j ) i ′ ,j ′ = ( , i ′ = i and j ′ = j , otherwise.and note that a change in the value of U k ( m ) changes thevalue of ξ by √ a k tr (cid:0) B T ( E m − ,K (2 m − k + E m,K (2 m − k ) A (cid:1) = 2 √ a k tr (cid:0) AB T ( E m − ,K (2 m − k + E m,K (2 m − k ) (cid:1) = 2 √ a k (cid:0) ( AB T ) K (2 m − k, m − + ( AB T ) K (2 m − k, m (cid:1) ≤ √ P (cid:0) ( AB T ) K (2 m − k, m − + ( AB T ) K (2 m − k, m (cid:1) Following the proof of the Bounded Differences Inequal-ity [46, Theorem 6.2], we can now conclude τ ( ξ ) ≤ K X k =1 M X m =1 (cid:16) √ P ( AB T ) (2 m − K + k, m − + 2 √ P ( AB T ) (2 m − K + k, m (cid:17) ≤ · · · P k AB T k F , concluding the proof of the lemma. Proof of Theorem 1.
What remains to be established is theconcentration of k Y k around its expectation. To this end, weobserve P (cid:0)(cid:12)(cid:12) k Y k − E k Y k (cid:12)(cid:12) ≥ ε (cid:1) = P (cid:0)(cid:12)(cid:12) R T CR − E ( R T CR ) (cid:12)(cid:12) ≥ ε (cid:1) ≤ P (cid:16) | Σ | ≥ ε (cid:17) + P (cid:16) | Σ | ≥ ε (cid:17) (37)where Σ := KM +2 M X i =1 (cid:0) R i C i,i − E (cid:0) R i C i,i (cid:1)(cid:1) Σ := KM +2 M X i,j =1 i = j R i R j C i,j . We use Lemma 1, Lemma 3 and Lemma 4 to bound themoment generating function of Σ as E exp( λ Σ ) ≤ exp λ F − c + ˜ F !! ≤ exp λ · F + ˜ F − c ! for any c ∈ (0 , and λ ∈ ( − c/ (2 ˜ L )) , c/ (2 ˜ L )) , where ˜ L := (cid:16)p ∆( f || P ) k A k + k B k (cid:17) ˜ F := ˜ L (cid:16)p ∆( f || P ) k A k F + k B k F (cid:17) ˜ F := (cid:16) M ∆( f || P ) k ( A + A i A U )( A − A i A U ) T k + 2 √ P k AB T k F (cid:17) By Lemma 9, this yields P (cid:16) | Σ | ≥ ε (cid:17) ≤ (cid:18) − (1 − c ) ε
64 ˜ F + 8 ˜ F (cid:19) in case < ε ≤ c − c · F + ˜ F ˜ L and P (cid:16) | Σ | ≥ ε (cid:17) ≤ (cid:18) − cε L (cid:19) otherwise. Since the first case term is increasing with c andthe second case term is decreasing, the optimal value for c iswhere the two cases meet, which is at c = ˜ Lε ˜ Lε + 8 ˜ F + ˜ F . mean norm DFA TDMA − −
10 0 10 20 30 40 5010 − − − noise power in dB m ea n s qu a r e d e rr o r K M
40 10 Fig. 3. Mean squared error of the approximation schemes dependent on thechannel noise power. − − − users m ea n s qu a r e d e rr o r noise M − dB dB Fig. 4. Mean squared error of the approximation schemes dependent on thenumber of participating transmitters.
Substituting this, we get P (cid:16) | Σ | ≥ ε (cid:17) ≤ (cid:18) − ε
64 ˜ F + 8 ˜ F + 8 ˜ Lε (cid:19) . Turning our attention to Σ , we note that by [47, Theorem2.1] the operator norm of the off-diagonal part of C can beupper bounded by k C k and thus by L . Therefore, we candirectly apply Lemma 2 and get P (cid:16) | Σ | ≥ ε (cid:17) ≤ (cid:18) − ε F + 64 ˜ Lε (cid:19) . Substituting these into (37) and using (28) concludes the proof.VI. N
UMERICAL R ESULTS
We have simulated the Distributed Function Approximation(DFA) scheme for Rayleigh fading channels with varying noise . . . . · − − − channel uses m ea n s qu a r e d e rr o r noise/dB K −
20 4030 2560
Fig. 5. Mean squared error of the approximation schemes dependent on thenumber of channel uses. power, number of users and amount of channel resources.The simulations were done for two different functions, withthe function arguments in both cases confined to the unitinterval [0 , , to highlight different aspects and properties ofthe scheme: The arithmetic mean function is linear and mapsonly to the interval [0 , (which means that no scheme canhave an error larger than ), while the Euclidian norm functionmaps to [0 , √ K ] and can show how the DFA scheme dealswith nonlinearities.We compare with a simple TDMA scheme, in which eachuser transmits separately in its designated slot, protecting theanalog transmission against channel noise in the same fashionas the DFA scheme, but not sharing the channel use with othertransmitters. In the case where the number of channel usesavailable is much larger than the number of users sharingthe resources, this form of a TDMA scheme is of coursehighly suboptimal, as the transmitters could use source andchannel coding to achieve a higher reliability. However, suchan approach is infeasible if the number of users is so high incomparison to the number of channel uses that only a few orpossibly even less than one channel use is available to eachuser, and in this work we are mainly interested in the scalingbehavior of our schemes in the number of users K . Therefore,this comparison provides an insight into the gain achieved byexploiting the superposition properties of the wireless channelwhile keeping in mind that for the regime of low K , thereare better coded schemes available. We also remark that theDFA scheme only needs coordination between the transmittersinsofar as all users need to transmit roughly at the sametime, while a TDMA scheme necessitates an allocation ofthe channel uses to the individual transmitters, which can becostly in the case of high K . The simulations carried outin this section do not consider this scheduling problem andassume for the TDMA scheme that the time slots have alreadybeen allocated, and this knowledge is available at both thetransmitters and the receiver. If M < K , there is not at leastone channel use available to each auser and the TDMA schemecan therefore not be carried out. We set the error in such casesto the maximum of or √ K , respectively. For the simulations, we assume a normalized peak tranmitterpower constraint of and channels with fading normalized toa variance of per complex dimension. The power of theadditive noise is given in dB per complex dimension and itsnegative can therefore be considered as the signal-to-noiseratio (SNR). Each plotted data point is based on an averageof simulation runs.The messages transmitted by the users are generated in thefollowing way: First, we draw a value µ , which is common toall transmitters, uniformly at random from [0 , . We then drawthe messages of all the users from a convex combination ofthe uniform distributions on [0 , µ ] and [ µ, where we choosethe weights in such a way that each message has expectation µ . The reason for choosing this procedure although the DFAscheme also performs well for more natural distributions suchas i.i.d. uniform in [0 , for all users is that in case ofmessages distributed according to a known i.i.d. distribution,the problem is too easy in the sense that both the mean andthe Euclidian norm concentrate around values that dependonly on the distribution and K , and therefore even withoutany communication at all, the function value can be quiteaccurately guessed if K is large. On the other hand, we intendthe DFA scheme for applications in which the messages can becorrelated and distributed according to unknown distributions,so we opt for this form of correlation between the messagesfor the sake of the numerical evaluation.In Fig. 3, we can see that the DFA scheme is at least asgood as the TDMA schemes for all the plotted data pointsand outperforms it in most cases, achieving a gain of upto dB for K = 2560 . We also see that for low powersof the additive noise, the effect of the multiplicative fadingdominates, and therefore, the error saturates as the additivenoise grows weaker. In Fig. 4, we can see that the DFA schemeperforms significantly better if the number of users is not toolow, which is due to the superposition of the signals in thewireless channel resulting in a combined signal strength thatgrows with the number of users. We can also see the TDMAscheme performing similarly to the DFA scheme for lownumbers of users, while quickly deteriorating in performanceor even becoming infeasible as their number grows. In Fig. 5,we can observe the exponential decay of the error as theamount of channel resources used increases. Once again, wecan observe that the TDMA scheme performs similarly to DFAfor a low number of users, but becomes infeasible for larger K . A PPENDIX
A. Preliminaries on Sub-Gaussian and Sub-Exponential Ran-dom Variables
We begin with a definition that is adapted from [40, Definition3.4.1]. For R n -valued random variables X , we define the sub-gaussian norm as τ ( X ) := inf n t : ∀ a ∈ S n − ∀ λ ∈ RE exp( λ h X, a i ) ≤ exp (cid:18) t λ (cid:19) o (38) and we observe that if all entries of X have a sub-gaussiannorm bounded by K and are independent, we have for any a ∈ S n − E exp ( λ h X, a i ) = E exp λ n X k =1 X k a k ! = n Y k =1 E exp ( λX k a k ) ≤ n Y k =1 exp (cid:18) K a k λ (cid:19) = exp (cid:18) K λ (cid:19) and therefore τ ( X ) ≤ K .In the following, we recall some basic definitions and resultsfrom [38, Chapter 1]. For a random variable X we define θ ( X ) := sup k ≥ (cid:18) E ( | X | k ) k ! (cid:19) k (39)If θ ( X ) < ∞ then X is called a sub-exponential randomvariable. θ ( · ) defines a semi-norm on the vector space ofsub-exponential random variables [38, Remark 1.3.2]. Typicalexamples of sub-exponential random variables are boundedrandom variables and random variables with exponential distri-bution. We collect some useful properties of and interrelationsbetween the sub-exponential and sub-gaussian norms in thefollowing lemma. Lemma 5.
Let
X, Y be random variables. Then: If X is N ( µ, σ ) then we have τ ( X ) = σ. (40)2) (Rotation Invariance) If X , . . . , X M are independent,sub-gaussian and centered, we have τ M X m =1 X m ! ≤ M X m =1 τ ( X m ) (41)3) If X is a random variable with | X | ≤ with probability and if Y is independent of X and sub-gaussian thenwe have τ ( X · Y ) ≤ τ ( Y ) . (42)4) If X and Y are sub-gaussian and centered, then X · Y is sub-exponential and θ ( X · Y ) ≤ · τ ( X ) · τ ( Y ) . (43)5) (Centering) If X is sub-exponential and X ≥ almostsurely, then θ ( X − E ( X )) ≤ θ ( X ) . (44) Proof. (40) follows in a straightforward fashion by calculatingthe moment generating function of X . (41) is e.g. proven in[38, Lemma 1.1.7]. (42) follows directly from the definitionconditioning on X . We show (43) first for X = Y . In this Note that as with our definition of the sub-gaussian norm, other norms onthe space of sub-exponential random variables that appear in the literature areequivalent to θ ( · ) (see, e.g., [38]). The particular definition we choose herematters, however, because we want to derive results in which no unspecifiedconstants appear. case, we have θ (cid:0) X (cid:1) = sup k ≥ (cid:18) E X k k ! (cid:19) k ≤ sup k ≥ k +1 k k τ ( X ) k e k k ! ! k = 2 τ ( X ) sup k ≥ k ke ( k !) k ! ≤ τ ( X ) , where the first inequality is by [38, Lemma 1.1.4] and thesecond follows from k k /k ! ≤ e k , which is straightforward toprove for k ≥ by induction. In the general case, we have θ ( XY ) = τ ( X ) τ ( Y ) θ (cid:18) XYτ ( X ) τ ( Y ) (cid:19) ≤ τ ( X ) τ ( Y ) θ (cid:18) Xτ ( X ) (cid:19) + 12 (cid:18) Yτ ( Y ) (cid:19) ! ≤ τ ( X ) τ ( Y ) , where the first inequality can be verified in (39), consideringthat ab ≤ a / b / for all a, b ∈ R , and the secondinequality follows from the triangle inequality and the specialcase X = Y .For (44), we assume without loss of generality E X = 1 (otherwise we can scale X ), and note that for all a ∈ [0 , ∞ ) and k ≥ , a k − | a − | k > a − and thus E ( X k − | X − | k ) ≥ E ( X −
1) = 0 . B. Proof of Lemma 2
The proof closely follows the proof of the Hanson-Wrightinequality in [40, Theorem 6.2.1]. We carry out the changesthat are necessary to arrive at explicit constants. To this end,we begin with some slightly modified versions of lemmas usedas ingredients in the proof of Bernstein’s inequality in [38,Theorem 1.5.2].
Lemma 6.
Let X be a random variable with E ( X ) = 0 and θ ( X ) < + ∞ . For any λ ∈ R with | λθ ( X ) | < we have E (exp( λX )) ≤ | λ | θ ( X ) · − | λθ ( X ) | . Proof.
Let λ ∈ R satisfy | λθ ( X ) | < . Then E (exp( λX )) = 1 + ∞ X k =2 λ k E ( X k ) k ! ≤ ∞ X k =2 | λ | k E ( | X | k ) k ! ≤ ∞ X k =2 | λ | k θ ( X ) k = 1 + | λ | θ ( X ) ∞ X k =0 | λθ ( X ) | k ! = 1 + | λ | θ ( X ) · − | λθ ( X ) | , (45)where in the last line we have used | λθ ( X ) | < .In the next lemma we derive an exponential bound depend-ing on θ ( X ) on the moment generating function of the randomvariable X . Lemma 7.
Let X be a random variable with E ( X ) = 0 and θ ( X ) < + ∞ . For any c ∈ (0 , and λ ∈ (cid:16) − cθ ( X ) , cθ ( X ) (cid:17) wehave E (exp( λX )) ≤ exp (cid:16) λ · θ ( X ) − c (cid:17) . Proof.
For λ ∈ (cid:16) − cθ ( X ) , cθ ( X ) (cid:17) we have | λθ ( X ) | < c < , (46)therefore by Lemma 6 E (exp( λX )) ≤ | λ | θ ( X ) · − | λθ ( X ) |≤ | λ | θ ( X ) · − c ≤ exp (cid:16) λ · θ ( X ) − c (cid:17) , where in the second line we have used the first inequality in(46) and the last line is by the numerical inequality x ≤ exp( x ) valid for x ≥ . Lemma 8.
Let X , . . . , X M be independent random variableswith E ( X i ) = 0 and θ ( X i ) < + ∞ , i = 1 , . . . , M . Let L :=max ≤ i ≤ M θ ( X i ) , c ∈ (0 , , and λ ∈ (cid:0) − cL , cL (cid:1) . Then for S M := P Mi =1 X i we have E (exp( λS M )) ≤ exp (cid:16) λ · P Mi =1 θ ( X i ) − c (cid:17) . (47) Proof.
By independence of X , . . . , X M we have E (exp( λS M )) = M Y i =1 E (exp( λX i )) . Combining this with Lemma 7 proves the lemma.The next lemma establishes the basic tail bound for randomvariables satisfying inequalities of type (47). The proof can befound in [38, Lemma 1.4.1].
Lemma 9.
Let X be a random variable with E ( X ) = 0 . Ifthere exist τ ≥ and Λ > such that E (exp( λX )) ≤ exp (cid:16) λ τ (cid:17) , holds for all λ ∈ ( − Λ , Λ) , then for any t ≥ we have P ( | X | ≥ t ) ≤ · Q ( t ) , where Q ( t ) = ( exp (cid:16) − t τ (cid:17) , < t ≤ Λ τ exp (cid:0) − Λ t (cid:1) , Λ τ ≤ t. The following lemma is a slightly modified version of [40,Lemma 6.2.3].
Lemma 10 (Comparison Lemma) . Let X and X ′ be indepen-dent, R n -valued, centered and sub-gaussian random variables,and let g, g ′ be independent and distributed according to N (0 , id n ) . Let A ∈ R n × n and λ ∈ R . Then E exp( λX T AX ′ ) ≤ E exp( λτ ( X ) τ ( X ′ ) g T Ag ′ . ) Proof.
We first observe that for any x ∈ R n , E (exp( λ h X, x i )) = E (cid:18) exp (cid:18) λ k x k (cid:28) X, x k x k (cid:29)(cid:19)(cid:19) ≤ exp (cid:16) λ k x k τ ( X ) (cid:17) = E (cid:16) exp (cid:0) λτ ( X ) h g, x i (cid:1)(cid:17) , (48)where the inequality in (48) is by the definition of vector-valued sub-gaussian random variables and the equality isobtained by calculating the moment-generation function of h g, x i . We can now conclude the proof from the following: E (exp( λX T AX ′ ))= E X ′ ( E X (exp( λ h X, AX ′ i ))) (49) ≤ E X ′ ( E g (exp( λτ ( X ) h g, AX ′ i ))) (50) = E g ( E X ′ (exp( λτ ( X ) (cid:10) X ′ , A T g (cid:11) ))) (51) ≤ E g ( E g ′ (exp( λτ ( X ) τ ( X ′ ) (cid:10) g ′ , A T g (cid:11) ))) (52) = E (exp( λτ ( X ) τ ( X ′ ) g T Ag ′ )) , (53)where (49), (51) and (53) are due to Fubini’s theorem and el-ementary transformations and (50) and (52) are both instancesof the observation (48). Proof of Lemma 2.
We can write X T AX = n X k,ℓ =1 ,k = ℓ X k A k,ℓ X ℓ , (54)and since X is centered, E (cid:0) X T AX (cid:1) = 0 immediatelyfollows. Let X ′ be an independent copy of X , and let g and g ′ be independently distributed according to N (0 , id n ) .We denote the singular values of A with s , . . . , s n . Withthese definitions, we bound the moment-generating functionof X T AX as E exp (cid:0) λX T AX (cid:1) = E exp (cid:0) λX T AX (cid:1) (55) ≤ E exp (cid:0) λX T AX ′ (cid:1) (56) ≤ E exp (cid:16) λτ ( X ) g T Ag ′ (cid:17) (57) = E exp (cid:16) λτ ( X ) n X k =1 h k h ′ k s k (cid:17) (58) ≤ exp (cid:16) λ · τ ( X ) P nk =1 s k − c (cid:17) , (59)where (56) is due to the Decoupling Theorem [40, Theorem6.1.1], (57) is an application of Lemma 10, (58) holds forsuitably transformed versions h, h ′ of g, g ′ (note that they arestill independent and follow the same distribution) and (59)is true if c ∈ (0 , and (cid:12)(cid:12) λ (cid:12)(cid:12) < c/ (8 τ ( X ) max ≤ k ≤ n s k ) according to Lemma 8. So we can apply Lemma 9 to obtain P (cid:0)(cid:12)(cid:12) X T AX (cid:12)(cid:12) ≥ ε (cid:1) ≤ (cid:16) − ε (1 − c )256 τ ( X ) P nk =1 s k (cid:17) (60)in case ε ≤ c − c · τ ( X ) P nk =1 s k max ≤ k ≤ n s k and P (cid:0)(cid:12)(cid:12) X T AX (cid:12)(cid:12) ≥ ε (cid:1) ≤ (cid:16) − c · ε τ ( X ) max ≤ k ≤ n s k (cid:17) otherwise. We next choose c so as to minimize the upper boundon the tail probability. Because the bound in the first case isincreasing with c while it is decreasing in the second case,the optimal choice for c is where the two cases meet. We cantherefore calculate the optimal c as c = ε max ≤ k ≤ n s k ε max ≤ k ≤ n s k + 16 τ ( X ) P nk =1 s k and substituting this in (60), we obtain P (cid:0)(cid:12)(cid:12) X T AX (cid:12)(cid:12) ≥ ε (cid:1) ≤ (cid:16) − ε ετ ( X ) max ≤ k ≤ n s k + 256 τ ( X ) P nk =1 s k (cid:17) . The bounds τ ( X ) ≤ K , (cid:12)(cid:12) s k (cid:12)(cid:12) ≤ k A k , and identity k A k F = P nk =1 s allow us to conclude the proof of the lemma. R EFERENCES[1] I. Bjelakovi´c, M. Frey, and S. Sta´nczak, “Distributed approximationof functions over fast fading channels with applications to distributedlearning and the max-consensus problem,” in , Sep.2019, pp. 1146–1153.[2] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspec-tives, and prospects,”
Science , vol. 349, no. 6245, pp. 255–260, 2015.[3] R. Kumar, M. Wolenetz, B. Agarwalla, J. Shin, P. Hutto, A. Paul, andU. Ramachandran, “Dfuse: A framework for distributed data fusion,” in
Proceedings of the 1st international conference on Embedded networkedsensor systems , 2003, pp. 114–125.[4] E. F. Nakamura, A. A. Loureiro, and A. C. Frery, “Information fusion forwireless sensor networks: Methods, models, and classifications,”
ACMComputing Surveys (CSUR) , vol. 39, no. 3, 2007.[5] T. Cover, A. E. Gamal, and M. Salehi, “Multiple access channels witharbitrarily correlated sources,”
IEEE Transactions on Information theory ,vol. 26, no. 6, pp. 648–657, 1980.[6] B. Nazer and M. Gastpar, “Computation over multiple-access channels,”
IEEE Transactions on information theory , vol. 53, no. 10, pp. 3498–3516, 2007.[7] D. Petrovic, R. C. Shah, K. Ramchandran, and J. Rabaey, “Datafunneling: Routing with aggregation and compression for wireless sensornetworks,” in
Proceedings of the First IEEE International Workshop onSensor Network Protocols and Applications, 2003.
IEEE, 2003, pp.156–162.[8] P. Gupta and P. R. Kumar, “The capacity of wireless networks,”
IEEETransactions on information theory , vol. 46, no. 2, pp. 388–404, 2000.[9] A. El Gamal and Y.-H. Kim,
Network information theory . Cambridgeuniversity press, 2011.[10] A. N. Kolmogorov, “On the representation of continuous functionsof several variables by superposition of continuous functions of onevariable and addition,”
Dokl. Akad. Nauk SSS , vol. 114, pp. 953–956,1957.[11] D. A. Sprecher, “On the structure of continuous functions of severalvariables,”
Transactions of the American Mathematical Society , vol. 115,pp. 340–355, 1965.[12] R. C. Buck, “Approximate complexity and functional representation.”Wisconsin University Madison Mathematics Research Center, Tech.Rep., 1976.[13] ——, “Nomographic functions are nowhere dense,”
Proceedings of theAmerican Mathematical Society , pp. 195–199, 1982.[14] M. Gastpar and M. Vetterli, “Source-channel communication in sensornetworks,” in
Information Processing in Sensor Networks . Springer,2003, pp. 162–177.[15] M. Goldenbaum and S. Stanczak, “Robust analog function computationvia wireless multiple-access channels,”
IEEE Transactions on Commu-nications , vol. 61, no. 9, pp. 3863–3877, 2013.[16] M. Goldenbaum, H. Boche, and S. Sta´nczak, “Harnessing interferencefor analog function computation in wireless sensor networks,”
IEEETransactions on Signal Processing , vol. 61, no. 20, pp. 4893–4906, 2013.[17] ——, “Nomographic functions: Efficient computation in clustered gaus-sian sensor networks,”
IEEE Transactions on Wireless Communications ,vol. 14, no. 4, pp. 2093–2105, 2014.[18] J. Zhan, B. Nazer, M. Gastpar, and U. Erez, “MIMO compute-and-forward,” in . IEEE, 2009, pp. 2848–2852.[19] O. Ordentlich, J. Zhan, U. Erez, M. Gastpar, and B. Nazer, “Practicalcode design for compute-and-forward,” in . IEEE, 2011, pp. 1876–1880.[20] B. Nazer and M. Gastpar, “Compute-and-forward: Harnessing inter-ference through structured codes,”
IEEE Transactions on InformationTheory , vol. 57, no. 10, pp. 6463–6486, 2011.[21] B. Nazer, V. R. Cadambe, V. Ntranos, and G. Caire, “Expandingthe compute-and-forward framework: Unequal powers, signal levels,and multiple linear combinations,”
IEEE Transactions on InformationTheory , vol. 62, no. 9, pp. 4879–4909, 2016.[22] M. Goldenbaum, P. Jung, M. Raceala-Motoc, J. Schreck, S. Sta´nczak,and C. Zhou, “Harnessing channel collisions for efficient massive accessin 5G networks: A step forward to practical implementation,” in . IEEE, 2016, pp. 335–339.[23] K. Ralinovski, M. Goldenbaum, and S. Sta´nczak, “Energy-efficientclassification for anomaly detection: The wireless channel as a helper,” in , 2016,pp. 1–6. [24] W. Liu, X. Zang, Y. Li, and B. Vucetic, “Over-the-air computationsystems: Optimization, analysis and scaling laws,”
IEEE Transactionson Wireless Communications , 2020.[25] J. Dong, Y. Shi, and Z. Ding, “Blind over-the-air computation anddata fusion via provable wirtinger flow,”
IEEE Transactions on SignalProcessing , vol. 68, pp. 1136–1151, 2020.[26] M. M. Amiri and D. G¨und¨uz, “Machine learning at the wireless edge:Distributed stochastic gradient descent over-the-air,”
IEEE Transactionson Signal Processing , vol. 68, pp. 2155–2169, 2020.[27] J.-H. Ahn, O. Simeone, and J. Kang, “Wireless federated distillation fordistributed edge learning with heterogeneous data,” in . IEEE, 2019, pp. 1–6.[28] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas,“Communication-efficient learning of deep networks from decentralizeddata,” in
Proceedings of the 20 th International Conference on ArtificialIntelligence and Statistics (AISTATS) 2017 , 2017, pp. 1273–1282.[29] J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, andD. Bacon, “Federated learning: Strategies for improving communicationefficiency,” arXiv preprint arXiv:1610.05492 , 2016.[30] G. Zhu, Y. Wang, and K. Huang, “Broadband analog aggregation forlow-latency federated edge learning,”
IEEE Transactions on WirelessCommunications , 2019.[31] M. M. Amiri and D. G¨und¨uz, “Federated learning over wireless fadingchannels,”
IEEE Transactions on Wireless Communications , 2020.[32] K. Yang, T. Jiang, Y. Shi, and Z. Ding, “Federated learning via over-the-air computation,”
IEEE Transactions on Wireless Communications ,vol. 19, no. 3, pp. 2022–2035, 2020.[33] D. Liu, G. Zhu, J. Zhang, and K. Huang, “Data-importance aware userscheduling for communication-efficient edge machine learning,” arXivpreprint arXiv:1910.02214 , 2019.[34] T. Sery and K. Cohen, “On analog gradient descent learning over mul-tiple access fading channels,”
IEEE Transactions on Signal Processing ,vol. 68, pp. 2897–2911, 2020.[35] Y. Sun, S. Zhou, and D. G¨und¨uz, “Energy-aware analog aggre-gation for federated learning with redundant data,” arXiv preprintarXiv:1911.00188 , 2019.[36] M. Seif, R. Tandon, and M. Li, “Wireless federated learning with localdifferential privacy,” arXiv preprint arXiv:2002.05151 , 2020.[37] M. M. Amiri, T. M. Duman, and D. Gunduz, “Collaborative machinelearning at the wireless edge with blind transmitters,” arXiv preprintarXiv:1907.03909 , 2019.[38] V. Buldygin and Y. Kozachenko,
Metric Characterization ofRandom Variables and Random Processes , ser. Cross CulturalCommunication. American Mathematical Soc., 2000. [Online].Available: https://books.google.de/books?id=ePDXvIhdEjoC[39] M. J. Wainwright,
High-Dimensional Statistics: A Non-Asymptotic View-point , ser. Cambridge Series in Statistical and Probabilistic Mathematics.Cambridge University Press, 2019.[40] R. Vershynin,
High Dimensional Probability: An Introduction withApplications in Data Science , ser. Cambridge Series in Statistical andProbabilistic Mathematics. Cambridge University Press, 2018, vol. 47.[41] I. Steinwart and A. Christmann,
Support Vector Machines , ser. Informa-tion Science and Statistics. Springer, 2008.[42] A. Christmann and R. Hable, “Consistency of support vector machinesusing additive kernels for additive models,”
Computational Statistics &Data Analysis , vol. 56, no. 4, pp. 854–873, 2012.[43] M. Mohri, A. Rostamizadeh, and A. Talwalkar,
Foundations of MachineLearning , ser. Adaptive Computation and Machine Learning. MITPress, 2012.[44] N. Agrawal, M. Frey, and S. Sta´nczak, “A scalable max-consensus pro-tocol for noisy ultra-dense networks,” in , July 2019, pp. 1–5.[45] R. Bhatia,
Matrix analysis . Springer Science & Business Media, 1997,vol. 169.[46] S. Boucheron, G. Lugosi, and P. Massart,
Concentration Inequalities.A Nonasymptoitic Theory of Independence.
Oxford University Press,2013.[47] R. Bhatia, M.-D. Choi, and C. Davis, “Comparing a matrix to its off-diagonal part,” in