[PDF] Improved Communication Efficiency for Distributed Mean Estimation with Side Information

Abstract

In this paper, we consider the distributed mean estimation problem where the server has access to some side information, e.g., its local computed mean estimation or the received information sent by the distributed clients at the previous iterations. We propose a practical and efficient estimator based on an r-bit Wynzer-Ziv estimator proposed by Mayekar et al., which requires no probabilistic assumption on the data. Unlike Mayekar's work which only utilizes side information at the server, our scheme jointly exploits the correlation between clients' data and server' s side information, and also between data of different clients. We derive an upper bound of the estimation error of the proposed estimator. Based on this upper bound, we provide two algorithms on how to choose input parameters for the estimator. Finally, parameter regions in which our estimator is better than the previous one are characterized.

Full PDF

IImproved Communication Efﬁciency for DistributedMean Estimation with Side Information

Kai Liang ∗ and Youlong Wu ∗∗ ShanghaiTech University, Shanghai, China † Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences ‡ University of Chinese Academy of Sciences, Beijing, China { liangkai,wuyl1 } @shanghaitech.edu.cn Abstract —In this paper, we consider the distributed meanestimation problem where the server has access to some sideinformation, e.g., its local computed mean estimation or the re-ceived information sent by the distributed clients at the previousiterations. We propose a practical and efﬁcient estimator basedon an r -bit Wynzer-Ziv estimator proposed by Mayekar et al. ,which requires no probabilistic assumption on the data. UnlikeMayekar’s work which only utilizes side information at the server,our scheme jointly exploits the correlation between clients’ dataand server’s side information, and also between data of differentclients. We derive an upper bound of the estimation error of theproposed estimator. Based on this upper bound, we provide twoalgorithms on how to choose input parameters for the estimator.Finally, parameter regions in which our estimator is better thanthe previous one are characterized. Index Terms —distributed mean estimation, side information,distributed lossy compression

I. I

NTRODUCTION

With the development of modern machine learning technol-ogy, more powerful and complex machine learning models canbe trained through large-scale distributed training. However,due to the large scale of the model parameters, in each iterationof the distributed optimization, the exchange of informationbetween distributed nodes incurs a huge communication load,causing the problem of communication bottleneck.We focus on distributed mean estimation, which is a crucialprimitive for distributed optimization frameworks. Federatedlearning [1] is one of such frameworks, in which clientsparticipating in joint training only need to exchange theirown gradient information without sharing private data. Toalleviate the communication bottleneck, gradient compression[2]–[8] and efﬁcient mean estimator [9]–[15] have been in-vestigated to reduce the communication load. Recently, [16]studied distributed mean estimation with side information atthe server, and proposed Wyner-Ziv estimators that require noprobabilistic assumption on the clients data.In parallel, distributed source compression has been widelystudied in classical information theory. For example, [17] ﬁrststudied the setting of lossy source compression with side in-formation in the decoder. Channel coding can obtain practicalcoding for distributed source coding [18], [19], but the mainbottleneck lies in the expensive computational complexity ofcoding and decoding.

This work is supported by NSFC grant NSF61901267.

In this paper, we study practical schemes for distributedmean estimation with side information at the server. Themotivation is based on the fact that the server could storepublicly accessible data, and also at each iteration, the serverhas already received data sent by clients at previous iterations,which can be viewed as side information. Rather than usingrandom coding with joint typicality tools such as in [17], [20],which is impractical to implement, we follow the work in[16] which proposed a Wyner-Ziv estimator based on cosetcoding. Unfortunately, they only utilized the side informa-tion at the server, but failed to exploit correlation betweenclients’ vectors. In fact, in many scenarios such as stochasticgradient descent, data between different clients may have ahigh correlation since they wish to learn a global model.Inspired by Wyner-Ziv and Slepian-Wolf coding, we propose apractical scheme based on the coset coding and jointly exploitthe side information at the server and correlation betweenclients’ data. Note that in our scheme we must address anambiguity problem not existing in [16] or in the classic Wyner-Ziv coding. In more detail, since each client compresses itsdata and sends it to the server, the server only observes a lossyversion of clients’ vectors. This ambiguity cause mismatchinformation at the clients and server. Using the lossy version ofclients’ data at the server may even deteriorate the estimation.We summarize our contributions as follows: 1) We proposea new estimator that improves the estimator in [16] by jointlyexploiting the side information at the server and the corre-lation between clients’ data; 2) We derive an upper boundof estimation error of the proposed estimator; 3) We providetwo greedy algorithms on how to choose input parameters forthe estimator, and characterize the parameter regions in whichour estimator has a tighter upper bound than of the previousestimator. II. P

ROBLEM S ETTING

Consider the problem of distributed mean estimation withside information, as depicted in Fig. 1. The model consistingof n clients and one server, where each Client i ∈ [ n ] (cid:44) { , . . . , n } observes data x i ∈ X ⊂ R d and the server hasaccess to side information y = ( y , . . . , y n ) , y i ∈ Y ⊂ R d ,for some alphabets X , Y and positive integer d ∈ N . The a r X i v : . [ c s . I T ] F e b lient 1 Client 2 Client n Server x · · · x x n Side Information ( y , . . . , y n ) Fig. 1. Distributed mean estimation with side information server wishes to compute the empirical mean, i.e, ¯ x (cid:44) n n (cid:88) i =1 x i . (1)Note that the side information y could stem from some pub-licly accessible data or the server’s guess of x (cid:44) ( x , . . . , x n ) in the previous iterations.We focus on non-interactive protocols and study the r -bitsimultaneous message passing (SMP) protocol similar to thatin [16]. The r -bit SMP protocol π = ( π , . . . , π n ) consists of n encoders { Ψ i } ni =1 and one decoder Φ , of mapping forms: Ψ i : X → { , } r , (2) Φ : { , } r × . . . × { , } r (cid:124) (cid:123)(cid:122) (cid:125) n times ×Y n → R d . (3)Each Client i ∈ [ n ] uses the encoder Ψ i to encode x i into an r -bit message, i.e., m i = Ψ i ( x i , U ) , where U denotes a sharedrandomness known by all the server and clients. The Client i then sends the message m i to the server. Assume the message m i can be perfectly received by the server. After receiving allmessages m ( n ) = ( m , . . . , m n ) , the server uses decoder Φ toproduce ˆ¯ x as ˆ¯ x = Φ( m ( n ) , y , U ) . (4)The performance of the r -bit SMP protocol using protocol π with inputs x and y , is evaluated by the mean squared error(MSE), i.e., E π ( x , y ) (cid:44) E [ || ˆ¯ x − ¯ x || ] . (5)Instead of using any probabilistic assumption on inputdata and side information, we use the Euclidean distancebetween vectors to measure correlation among the data andside information. More speciﬁcally, let x i and y i be at most ∆ i and the distance between x i and x j be at most ∆ ij , i.e., || x i − y i || ≤ ∆ i , ∀ i ∈ [ n ] , (6a) || x i − x j || ≤ ∆ ij , ∀ i, j ∈ [ n ] . (6b)Since the distance is symmetric with ∆ ij = ∆ ji , it’s sufﬁcientto only consider ∆ ij with i < j . Let ∆ s (cid:44) (∆ , · · · , ∆ n ) , ∆ c (cid:44) (∆ , · · · , ∆ ( n − n ) . (7)We are interested in the performance of protocols when ∆ s and ∆ c are both known to clients and server. Deﬁne the optimal r -bits protocol with the minimum MSE as π ∗ , and thecorresponding MSE as E π ∗ ( x , y ) . Our goal is to ﬁnd practicaland efﬁcient r -bits SMP protocols, and derive tighter upperbounds on E π ∗ than the previous results.III. P REVIOUS W ORK

In [16], the authors proposed a SMP protocol based on an r -bit Wyner-Ziv quantizer Q WZ . The quantizer Q WZ containsan encoder mapping Q eWZ the same as (2) and a simpliﬁeddecoder mapping Q dWZ : { , } r ×Y → R d . Each Client i ∈ [ n ] ﬁrst uses the encoder Q eWZ to encode x i and then sends theencoded message m i to the server. The server uses the decoder Q dWZ to produce estimate ˆ x i as ˆ x i = Q dWZ ( m i , y i ) , (8)and then computes the sampling means as ˆ¯ x = 1 n n (cid:88) i =1 ˆ x i . (9)The quantizer Q WZ achieves the following upper bound onMSE. Theorem 1 (Upper bound given in [16]) . For a ﬁxed ∆ s and r ≤ d , the optimal r -bits protocol π ∗ satisfy E π ∗ ( x , y ) ≤ (79 (cid:100) log(2 + √

12 ln n ) (cid:101) + 26)( n (cid:88) i =1 ∆ i dn r ) , (10) for all x and y satisfying (6). Now we introduce the quantizer Q WZ , as it is closelyrelated to work. Since all clients use the same quantizer,only the common quantizer is described. We ﬁrst describe amodulo quantizer Q M for one-dimension input x ∈ R withside information h ∈ R , and then present a rotated moduloquantizer Q M ,R ( x, h ) for d -dimension data. Finally, the r -bitWyner-Ziv quantizer based on Q M and Q M ,R is given.

1) Modulo Quantizer ( Q M ): Given the input x ∈ R withside information h ∈ R , the modulo quantizer Q M containsparameters including a distance parameter ∆ (cid:48) where | x − h | ≤ ∆ (cid:48) , a resolution parameter k ∈ N + and a lattice parameter (cid:15) .Denote the encoder and decoder of Q M as Q eM ( x ) and Q dM ( Q eM ( x ) , h ) , respectively. The encoder Q eM ( x ) ﬁrst com-putes (cid:100) x/(cid:15) (cid:101) and (cid:98) x/(cid:15) (cid:99) , and then outputs the message Q eM ( x ) = m , where m = (cid:26) ( (cid:100) x/(cid:15) (cid:101) mod k ) , w.p. x/(cid:15) − (cid:98) x/(cid:15) (cid:99) ( (cid:98) x/(cid:15) (cid:99) mod k ) , w.p. (cid:100) x/(cid:15) (cid:101) − x/(cid:15) . (11)The message m has length of log k bits, and is sent to thedecoder. The decoder Q dM produces the estimate ˆ x by ﬁndinga point closest to h in the set Z m,(cid:15) = { ( zk + m ) · (cid:15) : z ∈ Z } .

2) Rotated Modulo Quantizer ( Q M ,R ): Given the input x ∈ R d with side information h ∈ R d where (cid:107) x − h (cid:107) ≤ ∆ , theinput parameters for Q M ,R include a distance parameter ∆ (cid:48) ,a resolution parameter k ∈ N + , a lattice parameter (cid:15) , and arotation matrix R given by R = W D/ √ d, (12)here W is the d × d Walsh-Hadamard Matrix [21] and D is a diagonal matrix with each diagonal entry generateduniformly from { +1 , − } by using a shared randomness. Afterthe rotation, every coordinate i ∈ [ d ] of R ( x − h ) , denotedby R ( x − h )( i ) , has zero mean sub-Gaussian with a variancefactor of ∆ /d , i.e., P ( | R ( x − h )( i ) | ≥ ∆ (cid:48) ) ≤ e − ∆ (cid:48) d . (13)The quantizer Q M ,R ﬁrst preprocesses x and h by multiply-ing both x and h with a matrix R , and then applies Q M foreach coordinate. Denote the encoder and decoder of Q M ,R as Q eM ,R ( x ) and Q dM ,R ( Q eM ,R ( x ) , h ) , respectively.

3) The r -bit Wyner-Ziv Quantizer ( Q WZ ): Note that in thequantizer Q M ,R the input x is encoded into d binary stringsof log k bits each, leading to a total number of d log k bits.In the r -bit Wyner-Ziv quantizer, the encoder ﬁrst encodes x using the same encoder as Q eM ,R ( x ) , and then uses a sharedrandomness to select a subset S ⊂ { , . . . , d } of these stringswith | S | = (cid:98) r/ log k (cid:99) , and ﬁnally sends them to decoder. Thedecoder uses the same decoder as Q dM ,R to decode the entriesin S . Denote the encoder and decoder of Q WZ as Q eWZ ( x ) and Q dWZ ( Q eWZ ( x ) , h ) , respectively.IV. N EW P ROTOCOL AND N EW U PPER BOUND

A. New Protocols

Note that in (8) only y i is used as side information to assistthe estimation for x i at the server. In fact, apart from y i ,the side information { y j } j (cid:54) = i and other clients’ data { x j } j (cid:54) = i could also be correlated to x i , and thus can be jointly utilizedto reduce the transmission load. The main challenge is that { x j } j (cid:54) = i cannot be perfectly known by the server, and thususing the estimate { ˆ x j } j (cid:54) = i as side information for x i mayeven deteriorate the estimation.Our protocol is based on a set of r -bit new quantizers,denoted by { Q L Π i Pro } ni =1 , where Π i denotes the i -th elementof a permutation Π of [ n ] , and L Π i is a chain parameter needto be designed and has a form of L Π i : y Π i → x Π i → x Π i · · · → x Π il , with x Π il = x Π i and l being the length ofchain.Given a set of chains {L Π i : i ∈ [ n ] } , the input data { x i } ni =1 are estimated in an order x Π , . . . , x Π n . For the input x Π i , thecorresponding quantizer Q L Π i Pro consists of an encoder the sameas Q eWZ ( x Π i ) , and a novel decoder Q d , L Π i Pro of mapping form Q d , L Π i Pro : { , } r × . . . × { , } r (cid:124) (cid:123)(cid:122) (cid:125) i times ×Y i → R d , that is used to decode x Π i as ˆ x Q L Π i Pro , Π i = Q d , L Π i Pro (cid:0) Q eWZ ( x Π ) , . . . , Q eWZ ( x Π i ) , y Π , . . . , y Π i (cid:1) . Given any quantizer Q , denotes its estimate for input x i as ˆ x Q,i . Here ˆ x Q L Π i Pro , Π i denotes the estimate for x Π i when usingthe quantizer Q L Π i Pro for the given chain L Π i . With a slightabuse of notation, we write ˆ x Q L Π i Pro , Π i as ˆ x Q Pro , Π i . In the following, we describe the quantizers { Q L Π i Pro } i ∈ [ n ] in two steps: 1) Given a set of chains {L Π i } i ∈ [ n ] , how toestimate ˆ x Q Pro , Π i , for i ∈ [ n ] ; 2) How to select proper chains {L Π i } i ∈ [ n ] to reduce the MSE.

1) New quantizer for some given chains {L i } i ∈ [ n ] : Withoutloss of generality, we assume that the estimation order Π is an identity permutation, i.e., Π i = i, i ∈ [ n ] . With thisassumption, the chain L Π i can be written as L i : y i → x i → x i · · · → x i l , (14)where x i l = x i , i t ∈ [ i − for all t = 1 , . . . , l − , and thedecoder i already has i − estimates: ˆ x Q Pro , , . . . , ˆ x Q Pro ,i − .The encoder is same as Q e WZ , i.e., Client i ﬁrst appliesthe encoder Q eM ,R ( x i ) to encode x i , then uses a sharedrandomness to select a subset S ⊂ { , . . . , d } of these stringswith | S | = (cid:98) r/ log k (cid:99) , and ﬁnally send them to decoder.The decoder Q d Pro ,i chooses an element in M i as the “side”information h for Q dM ,R , where M i = (cid:26) { y i , ˆ x Q Pro , , · · · , ˆ x Q Pro ,i − } , if i > { y i } , if i = 1 , (15)We emphasize that here the “side” information h could bethe estimate of other client’s data, rather than the literal sideinformation y i used in the Wyner-Ziv quantizer Q WZ .Given the chain L i in (14), denote ∆ (cid:48) i and ∆ (cid:48) i s i s +1 asweight parameters of subchains y i → x i and x i s → x i s +1 ,respectively. The choices of ∆ (cid:48) i and ∆ (cid:48) i s i s +1 is based on (13),and follows a way similar to that in the quantizer Q M ,R . For t ∈ [ l ] , let w i t (cid:44) ∆ (cid:48) i + (cid:80) t − s =1 ∆ (cid:48) i s i s +1 .Given a vector v ∈ R d and a subset S ∈ [ d ] , let v ( S ) (cid:44) ( v i : i ∈ S ) . The decoder estimates ˆ x Q Pro ,i t ( S ) as the output valuesin dimension S of the decoder Q d M ,R ( Q eM ,R ( x i t ) , ˆ x Q Pro ,i t − ) with parameters ∆ (cid:48) = w i t and h = ˆ x Q Pro ,i t − , and estimatethe values of ˆ x Q Pro ,i t ([ n ] \ S ) as those values in y i t , i.e., ˆ x Q Pro ,i t ( S ) = Q d M ,R ( Q eM ,R ( x i t ) , ˆ x Q Pro ,i t − )( S ) , (16a) ˆ x Q Pro ,i t ([ n ] \ S ) = y i t ([ n ] \ S ) . (16b)By recursively using (16), we can obtain the estimate ˆ x Q Pro ,i = ˆ x Q Pro ,i l for the input x i .In our protocol, the parameters h and ∆ (cid:48) depend the designof chains {L i } ni =1 , S is generated by the shared randomness,and k, (cid:15) can be freely assigned. When the length chain L i is l = 1 , the quantizer Q L i Pro reduces to the Wyner-Ziv quantizer Q WZ if the chosen chain is L i := y i → x i .

2) Selection of Chains:

Next we give two algorithms onhow to choose chains {L i : i ∈ [ n ] } . Algorithm 1 : Given weight parameters { ∆ (cid:48) i : i ∈ [ n ] } and { ∆ (cid:48) ij : i, j ∈ [ n ] , i < j } , for Client , we use the chain L = y → x with weight ∆ (cid:48) . For Client i , we suppose thatthe chains {L j : j ∈ [ i − } with weight w j : j ∈ [ i − arealready known.Construct i chains {L i ( j ) : j ≤ i } as follows. L i ( j ) = L j → x i , if j ∈ [ i − , L i ( j ) = y j → x j , if j = i. (17)hen, compute the weights { w i ( j ) : j ∈ [ i ] } according to w i ( j ) = w j + ∆ (cid:48) ji , if j ∈ [ i − ,w i ( j ) = ∆ (cid:48) j , if j = i. (18)For each Client i , L i can be chosen from i candidate chainsin {L i ( j ) : j ∈ [ i ] } . We choose L i = L i ( j ∗ ) such that j ∗ (cid:44) arg min j ∈ [ i ] w i ( j ) . We formally describe the methodin Algorithm 1. Algorithm 1

Selection of Chains

Input:

Input { ∆ (cid:48) i : i ∈ [ n ] } and { ∆ (cid:48) ij : i, j ∈ [ n ] , i < j } Output:

Chains Chains ← ∅ Chains ← Chains ∪ {L = y → x } W eights ← ∅ W eights ← W eightss ∪ { w = ∆ (cid:48) } for ≤ i ≤ n do Generate {L i ( j ) : j ≤ i } as (17) Compute { w i ( j ) : j ≤ i } as (18) Compute j ∗ = arg min j ∈ [ i ] w (cid:48) ( j ) L i ← L i ( j ∗ ) , w i ← w i ( j ∗ ) Chains ← Chains ∪ {L i } W eights ← W eights ∪ { w i } return Chains

Algorithm 2 : Note that Algorithm 1 is simple and fast,but may not ﬁnd good chains to improve the MSE in (10).Therefore, we are interested in ﬁnding good chains and thecorresponding region of ( ∆ c , ∆ s ) such that the upper boundof MSE is smaller than (10). We illustrate our idea with aspecial case where the length of each chain is less than 2.For Client i , consider y t → x t → x i and y i → x i with t < i . By Remark 3 (described later in Section IV-B), we selectthe chain as follows: If (∆ t , ∆ ti , ∆ i ) ∈ R is in the region R deﬁned in (23), then we use the chain L i as y t → x t → x i ,otherwise we use the chain y i → x i . Now we look for goodchains according to R . Without loss of generality, let ∆ ≤· · · ≤ ∆ n . Starting from Client , ﬁrstly, generate a chain oflength of for Client , and then traverse the remaining clientsto verify whether the corresponding distances are in R . If thedistances are in R , then construct a chain of length of . Forthe remaining clients whose chains are empty, renumber themand repeat the above process until every client has a nonemptychain. We formally describe the method in Algorithm 2. B. New Upper Bound of MSE

Deﬁne the following quantities: α i ( Q ) (cid:44) sup x , y satisfy (6) E [ (cid:107) ˆ x Q,i − x i || ] , (19) β i ( Q ) (cid:44) sup x , y satisfy (6) || E [ˆ x Q,i − x i ] (cid:107) . (20)Given a chain L i and a speciﬁc parameters assignment, thefollowing lemma gives a recursive inequality about the upperbound of the error when using the quantizer Q L i Pro . Algorithm 2

Selection of chains for special case

Input:

Input ( ∆ s , ∆ c ) , R Output:

Chains Chains ← ∅ C ← List [1 , · · · , n ] while C is not empty do N ode ← ∅ N ode ← N ode ∪ { C [0] } Generate y C [0] → x C [0] for Client C [0] Chains ← Chains ∪ { y C [0] → x C [0] } for i > do if (∆ C [0] , ∆ C [0] C [ i ] , ∆ C [ i ] ) ∈ R then Generate y C [0] → x C [0] → x C [ i ] for Client C [ i ] Chains ← Chains ∪ { y C [0] → x C [0] → x C [ i ] } N ode ← N ode ∪ { C [ i ] } Deleta clients in

N ode from C return Chains

Lemma 1.

For a given chain L i : y i → x i → x i · · · → x i l , when using the quantizer Q L i Pro with the following pa-rameters: k ≥ , ∆ (cid:48) i = (cid:113) i /d ) ln √ n , ∆ (cid:48) i s i s +1 = (cid:113) i s i s +1 /d ) ln √ n , where s ∈ [ l − , µd = (cid:98) r/ log k (cid:99) and (cid:15) = 2∆ (cid:48) i / ( k −

2) + (cid:80) l − s =1 (cid:48) i s i s +1 / ( k − , we have α i t ( Q L it Pro ) ≤ t (∆ i + (cid:80) t − s =1 ∆ i s i ( s +1) )ln √ nµ ( k − + c t ( n ) ∆ i + (cid:80) t − s =1 ∆ i s i ( s +1) +∆ i ( t − i t µn + 3 α i t − ( Q L it − Pro ) + ∆ i t µ ,β i t ( Q L i Pro ) ≤ c t ( n ) ∆ i + (cid:80) t − s =1 ∆ i s i ( s +1) + ∆ i ( t − i t n + 3 α i t − ( Q L it Pro ) , and α i ( Q L i Pro ) ≤ i ln √ nµ ( k − + 154 ∆ i µn + ∆ i µ ,β i ( Q L i Pro ) ≤

154 ∆ i n , for all t ∈ [ l ] and c t ( n ) (cid:44) max { t e , n + e / } .Proof. See the proof in Appendix A.By properly scaling and choosing an appropriate k , weobtain a more concise form in the following corollary, whoseproof is given in Appendix B. Corollary 1.

If setting log k = (cid:100) log(2 + √

12 ln n ) (cid:101) , we have α i l ( Q L il Pro ) ≤ D ii l ln √ nµ ( k − + 154 D ii l µn + ∆ i l µ , i l ( Q L il Pro ) ≤ D ii l n , and D ii l satisﬁes that for l = 1 , D ii = ∆ and for l > , D ii l =max (cid:110) l (∆ i + l − (cid:88) s =1 ∆ i s i ( s +1) ) , c l ( n )154 (∆ i (21) + l − (cid:88) s =1 ∆ i s i ( s +1) + ∆ i ( l − i l )+ 3 nD ii l − (cid:111) +3 D ii l − . Theorem 2.

For some ﬁxed ( ∆ s , ∆ c ) , and d ≥ r ≥ (cid:100) log(2+ √

12 ln n ) (cid:101) , and µd = (cid:98) r log k (cid:99) , the MSE is upper bounded by E π ∗ ( x , y ) ≤ (79 log k + 26) n (cid:88) i =1 d ∆ i n r + B n (cid:88) i =1 d ( D ii l − ∆ i ) n r , (22) for all sets of chains {L i } ni =1 and B = (cid:26)

79 log k + 26 , if (cid:80) ni =1 ( D ii l − ∆ i ) ≥ log k . otherwise , where log k = (cid:100) log(2 + √

12 ln n ) (cid:101) and D ii l is given in (21) . Remark 1.

We improve the upper bound of MSE in (10) when ( ∆ c , ∆ s ) are in the region R L = { ( ∆ c , ∆ s ) : (cid:80) ni =1 D ii l < (cid:80) ni =1 ∆ i } . In the region R L our new upper bound in (22) is − log k k +26) (1 − (cid:80) ni =1 D iil (cid:80) ni =1 ∆ i ) times of that in (10). Remark 2.

For each permutation Π on [n], since each Client Π i can choose L Π i from i candidate chains, there are n ! different assignments {L Π i : i ∈ [ n ] } . Thus, the number of allstrategies will not exceed ( n !) . Denote the chain correspond-ing to each strategy as L i , i ∈ [( n !) ] . From Theorem 2 andRemark 1, we obtain that when ( ∆ c , ∆ s ) ∈ (cid:83) i ∈ [( n !) ] R L i ,our upper bound is tighter than that in (10) . Remark 3.

If Client i uses the chain y t → x t → x i , by(21), we have D ii = max { t + ∆ ti ) , c ( n )154 (∆ t + 2∆ ti ) + n ∆ t } + 3∆ t , where c ( n ) = max { e , n + e / } . If Client i uses the chain y i → x i , by (21), we have D ii = ∆ i . Let R (cid:44) (cid:110) (∆ t , ∆ ti , ∆ i ) : max (cid:8) t + ∆ ti ) + 3∆ t ,c ( n )(∆ t +2∆ ti )+ 3 n ∆ t

154 +3∆ t (cid:9) < ∆ i (cid:111) . (23) Then, by Theorem 2, if there exists some i, t ∈ [ n ] suchthat (∆ i , ∆ t , ∆ ti ) ∈ R , then our estimator Q pro improvesthe Wyner-Ziv estimator Q WZ proposed in [16]. Remark 4.

From (23) , we observe that to choose a chainwhose length is larger than 2, there must exist at least one pair (∆ t , ∆ i ) such that t < ∆ i , for some t, i ∈ [ n ] . Otherwise,our estimator turns to be the Wyner-Ziv estimator in [16].The condition t < ∆ i seems a stringent assumption at theﬁrst glance. In fact, since our quantizer is for speciﬁc vectors x and y , and the Euclidean distance (cid:107) x i − y i (cid:107) ≤ ∆ i and (cid:107) x j − y j (cid:107) ≤ ∆ j , for i, j ∈ [ n ] can vary greatly. Also, onecan spend additional bits on better estimating ∆ t and ∆ ti such that they are smaller enough to satisfy (23) , and this additional bits cost on estimating ∆ t and ∆ ti can be omittedwhen n is relatively large. V. P

ROOF OF T HEOREM r -bit quantizer. Lemma 2. (see [16]) For x and y satisfying (6), and an r -bit quantizer Q using independent randomness for different i ∈ [ n ] , the estimate ˆ¯ x in (9) and the sample mean ¯ x satisﬁes E [ || ˆ¯ x − ¯ x || ] ≤ n (cid:88) i =1 α i ( Q ) n + n (cid:88) i =1 β i ( Q ) n . (24)By Lemma 2, we have E π ∗ ( x , y ) ≤ n (cid:88) i =1 α i ( Q L i Pro ) n + n (cid:88) i =1 β i ( Q L i Pro ) n ( a ) ≤ rµd ( 24 ln √ n ( k − + 154 n +154 µ ) n (cid:88) i =1 dD ii l n r + rµd n (cid:88) i =1 d ∆ i n r = rµd ( 24 ln √ n ( k − + 154 n +1+154 µ ) n (cid:88) i =1 d ∆ i n r + rµd ( 24 ln √ n ( k − + 154 n +154 µ ) n (cid:88) i =1 d ( D ii l − ∆ i ) n r ( b ) ≤ (79 (cid:100) log(2 + √

12 ln n ) (cid:101) + 26) n (cid:88) i =1 d ∆ i n r + rµd ( 24 ln √ n ( k − + 154 n +154 µ ) (cid:124) (cid:123)(cid:122) (cid:125) B n (cid:88) i =1 d ( D ii l − ∆ i ) n r , (25)where (a) follows by Corollary 1 and x i l = x i , and (b) followsby the inequality rµd (

24 ln √ n ( k − + n +1+154 µ ) ≤ (79 (cid:100) log(2 + √

12 ln n ) (cid:101) + 26) given in [16]. By this inequality, we canobtain an upper bound of B as B ≤ rµd (

24 ln √ n ( k − + n +1+154 µ ) ≤ (79 (cid:100) log(2 + √

12 ln n ) (cid:101) + 26) , and an lower bound B ≥ rµd ( 24 ln √ n ( k − ) ( a ) ≥ rµd ( 24 ln √ n √

12 ln n ) ) ( b ) ≥ r µd ( c ) ≥ (cid:100) log(2 + √

12 ln n ) (cid:101) , where (a) holds by log k ≤ log(2 + √

12 ln n ) + 1 , (b) isbecause

24 ln √ n √

12 ln n ) ≥

24 ln √ √

12 ln 2) ≥ / and (c) holdsby µd = (cid:98) r log k (cid:99) ≤ r log k .VI. C ONCLUSION

In this paper, we studied the distributed mean estima-tion with limited communication. Inspired by Wyner-Ziv andSlepian-Wolf coding, we proposed new estimator by exploitingthe correlation between clients’ data. In the future work, weaim to ﬁnd a more efﬁcient estimator and apply it to moregeneralized distributed optimization framework.

EFERENCES[1] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N.Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al. ,“Advances and open problems in federated learning,” arXiv preprintarXiv:1912.04977 , 2019.[2] F. Seide, H. Fu, J. Droppo, G. Li, and D. Yu, “1-bit stochastic gradientdescent and its application to data-parallel distributed training of speechdnns,” in

Fifteenth Annual Conference of the International SpeechCommunication Association , 2014.[3] H. Wang, S. Sievert, S. Liu, Z. Charles, D. Papailiopoulos, andS. Wright, “Atomo: Communication-efﬁcient learning via atomic sparsi-ﬁcation,”

Advances in Neural Information Processing Systems , vol. 31,pp. 9850–9861, 2018.[4] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd:Communication-efﬁcient sgd via gradient quantization and encoding,”in

Advances in Neural Information Processing Systems , 2017, pp. 1709–1720.[5] D. Alistarh, T. Hoeﬂer, M. Johansson, N. Konstantinov, S. Khirirat,and C. Renggli, “The convergence of sparsiﬁed gradient methods,” in

Advances in Neural Information Processing Systems , 2018, pp. 5973–5983.[6] S. U. Stich, J.-B. Cordonnier, and M. Jaggi, “Sparsiﬁed sgd withmemory,” in

Advances in Neural Information Processing Systems , 2018,pp. 4447–4458.[7] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad:Ternary gradients to reduce communication in distributed deep learning,”in

Advances in neural information processing systems , 2017, pp. 1509–1519.[8] J. Wangni, J. Wang, J. Liu, and T. Zhang, “Gradient sparsiﬁcation forcommunication-efﬁcient distributed optimization,”

Advances in NeuralInformation Processing Systems , vol. 31, pp. 1299–1309, 2018.[9] A. T. Suresh, X. Y. Felix, S. Kumar, and H. B. McMahan, “Distributedmean estimation with limited communication,” in

International Confer-ence on Machine Learning . PMLR, 2017, pp. 3329–3337.[10] J. Koneˇcn`y and P. Richt´arik, “Randomized distributed mean estimation:Accuracy vs. communication,”

Frontiers in Applied Mathematics andStatistics , vol. 4, p. 62, 2018.[11] W.-N. Chen, P. Kairouz, and A. ¨Ozg¨ur, “Breaking the communication-privacy-accuracy trilemma,” arXiv preprint arXiv:2007.11707 , 2020.[12] Z. Huang, W. Yilei, K. Yi et al. , “Optimal sparsity-sensitive boundsfor distributed mean estimation,” in

Advances in Neural InformationProcessing Systems , 2019, pp. 6371–6381.[13] P. Mayekar and H. Tyagi, “Ratq: A universal ﬁxed-length quantizerfor stochastic optimization,” in

International Conference on ArtiﬁcialIntelligence and Statistics . PMLR, 2020, pp. 1399–1409.[14] M. Safaryan, E. Shulgin, and P. Richt´arik, “Uncertainty principle forcommunication compression in distributed and federated learning andthe search for an optimal compressor,” arXiv preprint arXiv:2002.08958 ,2020.[15] A. Albasyoni, M. Safaryan, L. Condat, and P. Richt´arik, “Optimalgradient compression for distributed and federated learning,” arXivpreprint arXiv:2010.03246 , 2020.[16] P. Mayekar, A. T. Suresh, and H. Tyagi, “Wyner-ziv estimators: Efﬁcientdistributed mean estimation with side information,” arXiv preprintarXiv:2011.12160 , 2020.[17] A. Wyner and J. Ziv, “The rate-distortion function for source codingwith side information at the decoder,”

IEEE Transactions on informationTheory , vol. 22, no. 1, pp. 1–10, 1976.[18] S. S. Pradhan and K. Ramchandran, “Distributed source coding usingsyndromes (discus): Design and construction,”

IEEE transactions oninformation theory , vol. 49, no. 3, pp. 626–643, 2003.[19] R. Zamir, S. Shamai, and U. Erez, “Nested linear/lattice codes forstructured multiterminal binning,”

IEEE Transactions on InformationTheory , vol. 48, no. 6, pp. 1250–1276, 2002.[20] S. H. Lim, C. Feng, A. Pastore, B. Nazer, and M. Gastpar, “Towards analgebraic network information theory: Distributed lossy computation oflinear functions,” in . IEEE, 2019, pp. 1827–1831.[21] K. J. Horadam,

Hadamard matrices and their applications . Princetonuniversity press, 2012. A PPENDIX

A. Proof of Lemma 1

Our quantizer is based on Q M ,R , similar to Q WZ . We ﬁrstintroduce the following lemma, whose proof is similar to thatin [16]. With a slight abuse of notation, we write Q L il Pro as Q Pro . Lemma 3.

Fix ∆ i > . Then, for µd ∈ [ d ] , we have α i l ( Q Pro ) ≤ α i l ( Q M ,R ) µ + ∆ i l µ ; β i l ( Q Pro ) = β i l ( Q M ,R ) . Proof. E [ || ˆ x Q Pro ,i l − x i l || ]= (cid:88) j ∈ [ d ] E (cid:104) ( 1 µ ( R ˆ x Q M ,R ,i l ( j ) − Ry i l ( j )) { j ∈ S } − ( Rx i l ( j ) − Ry i l ( j ))) (cid:105) = (cid:88) j ∈ [ d ] E [( 1 µ ( R ˆ x Q M ,R ,i l − Rx i l ( j ))) { j ∈ S } ]+ (cid:88) j ∈ [ d ] E (cid:104) ( 1 µ ( Rx i l ( j ) − Ry i l ( j )) { j ∈ S } − ( Rx i l ( j ) − Ry i l ( j ))) (cid:105) = 1 µ (cid:88) j ∈ [ d ] E [( R ˆ x Q M ,R ,i l − Rx i l ( j )) ]+ (cid:88) j ∈ [ d ] E [( Rx i l ( j ) − Ry i l ( j )) ] · E [( 1 µ { j ∈ S } − ]= 1 µ (cid:88) j ∈ [ d ] E [( R ˆ x Q M ,R ,i l − Rx i l ( j )) ]+ (cid:88) j ∈ [ d ] E [( Rx i l ( j ) − Ry i l ( j )) ] · − µµ ≤ α i l ( Q M ,R ) µ + ∆ i l µ , where we use the independence of S and R in the third identityand use the fact that R is unitary in the ﬁnal step.Since || E [ˆ x Q Pro ,i l ] − x i l ) || = || (cid:88) j ∈ [ d ] E [( 1 µ ( R ˆ x Q M ,R ,i l ( j ) − Ry i l ( j ))) { j ∈ S } − ( Rx i l ( j ) − Ry i l ( j ))] e j || = || E [ˆ x Q M ,R ,i l ] − x i l ) || , where we use the independence of S and Q M ,R in the lastidentity, we have β i l ( Q Pro ) = β i l ( Q M ,R ) . he following lemma is given in [16], which shows that Q M is unbiased under certain conditions, and the error willnot exceed (cid:15) . Lemma 4. (see [16]) Consider Q M described in III-1 withparameter (cid:15) set to satisfy k(cid:15) ≥ (cid:15) + ∆ (cid:48) ) . (26) Then, for every x, h ∈ R such that | x − h | ≤ ∆ (cid:48) , the output Q M ( x ) satisﬁes E [ Q M ( x )] = x | x − Q M ( x ) | < (cid:15). Recall from Section IV-A that for a chain y i → x i → x i · · · → x i l , the server estimates x i l by using ˆ x Q L i Pro ,i ,i l − as h and ∆ (cid:48) i + (cid:80) l − s =1 ∆ (cid:48) i s i ( s +1) as the parameter ∆ (cid:48) in Q M ,R . ByLemma 3, we consider the quantizer Q M ,R . Lemma 5. If R given in (12) satisﬁes that for j ∈ [ d ] and t ∈ [ l ] , | Rx i ( j ) − Ry i ( j ) | ≤ ∆ (cid:48) i , | Rx i s ( j ) − Rx i s +1 ( j ) | ≤ ∆ (cid:48) i s i ( s +1) , s ∈ [ l − , then, for k ≥ , we have | Rx i t ( j ) − R ˆ x Q M ,R ,i t − ( j ) | ≤ ∆ (cid:48) i + t − (cid:88) s =1 ∆ (cid:48) i s i ( s +1) , t > , and if t = 1 , we denote ˆ x Q M ,R , by y i .Proof. We use induction for l . If l = 2 , since | Rx i ( j ) − Ry i ( j ) | ≤ ∆ (cid:48) i , by Lemma 4, we have | Rx i ( j ) − R ˆ x Q M ,R ,i ( j ) | ≤ (cid:48) i k − , where we set (cid:15) = (cid:48) i k − . Then, for k ≥ | Rx i ( j ) − R ˆ x Q M ,R ,i ( j ) |≤ | Rx i ( j ) − Rx i ( j ) + Rx i ( j ) − R ˆ x Q M ,R ,i ( j ) |≤ | Rx i ( j ) − Rx i ( j ) | + | Rx i ( j ) − R ˆ x Q M ,R ,i ( j ) |≤ ∆ (cid:48) i i + 2∆ (cid:48) i k − ≤ ∆ (cid:48) i i + ∆ (cid:48) i . Suppose | Rx i t − ( j ) − R ˆ x Q M ,R ,i t − ( j ) | ≤ ∆ (cid:48) i + t − (cid:88) s =1 ∆ (cid:48) i s i ( s +1) . By Lemma 4, we have | Rx i t − ( j ) − R ˆ x Q M ,R ,i t − ( j ) | ≤ (cid:48) i + (cid:80) t − s =1 ∆ (cid:48) i s i ( s +1) ) k − . So, for t and k ≥ , we have | Rx i t ( j ) − R ˆ x Q M ,R ,i t − ( j ) |≤ | Rx i t ( j ) − Rx i t − ( j ) + Rx i t − ( j ) − R ˆ x Q M ,R ,i t − ( j ) |≤ | Rx i t ( j ) − Rx i t − ( j ) | + | Rx i t − ( j ) − R ˆ x Q M ,R ,i t − ( j ) |≤ ∆ (cid:48) i ( t − i t + 2(∆ (cid:48) i + (cid:80) t − s =1 ∆ (cid:48) i s i ( s +1) ) k − ≤ ∆ (cid:48) + t − (cid:88) s =1 ∆ (cid:48) i s i ( s +1) . For convenience, let A denote event { R : | Rx i ( j ) − Ry i ( j ) | ≤ ∆ (cid:48) i } and A s denote event { R : | Rx i s − ( j ) − Rx i s ( j ) | ≤ ∆ (cid:48) i ( s − i s } . So, by Lemma 5, we have that if R ∈ (cid:84) s ∈ [ t ] A s , then | Rx i t ( j ) − R ˆ x Q M ,R ,i t − ( j ) | ≤ ∆ (cid:48) i + t − (cid:88) s =1 ∆ (cid:48) i s i ( s +1) . Thus, | Rx i t ( j ) − R ˆ x Q M ,R ,i t ( j ) | ≤ (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) = 2(∆ (cid:48) i + (cid:80) t − s =1 ∆ (cid:48) i s i ( s +1) ) k − . (27)where we set (cid:15) i = (cid:48) i k − and (cid:15) i s i ( s +1) = (cid:48) isi ( s +1) k − .For the random matrix R given in (12), for every z ∈ R d ,the random variables R z ( i ) , i ∈ [ d ] , are sub-Gaussianwith variance parameter || z || /d . Furthermore, we need thefollowing bound. Lemma 6 (see [16]) . For a sub-Gaussian random Z withvariance factor σ and every t ≥ , we have E [ Z {| Z | >t } ] ≤ σ + t ) e − t / σ . We now handle the α i t ( Q M ,R ) and β i t ( Q M ,R ) separatelybelow.Firstly, we consider t > . Since R is a unitary transform,we have E [ || ˆ x Q M ,R ,i t − x i t || ]= E [ || R ˆ x Q M ,R ,i t − Rx i t || ]= d (cid:88) j =1 E [( R ˆ x Q M ,R ,i t ( j ) − Rx i t ( j )) ]= d (cid:88) j =1 E [( R ˆ x Q M ,R ,i t ( j ) − Rx i t ( j )) (cid:84) s ∈ [ t ] A s ]+ d (cid:88) j =1 E [( R ˆ x Q M ,R ,i t ( j ) − Rx i t ( j )) (cid:83) s ∈ [ t ] A cs ] . (28)We consider the ﬁrst term. By (27), we have d (cid:88) j =1 E [( R ˆ x Q M ,R ,i t ( j ) − Rx i t ( j )) (cid:84) s ∈ [ t ] A s ] ≤ d ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) ) ≤ dt ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) ) , (29)where in the ﬁnal steps we use the fact that ( a + · · · + a n ) ≤ n ( a + · · · + a n ) .or the second term on (28), we get d (cid:88) j =1 E [( R ˆ x Q M ,R ,i t ( j ) − Rx i t ( j )) (cid:83) s ∈ [ t ] A cs ] ≤ d (cid:88) j =1 (cid:104) E [( R ˆ x Q M ,R ,i t ( j ) − R ˆ x Q M ,R ,i t − ( j )) (cid:83) s ∈ [ t ] A cs ]+ E [( R ˆ x Q M ,R ,i t − ( j ) − Rx i t − ( j )) (cid:83) s ∈ [ t ] A cs ]+ E [( Rx i t − ( j ) − Rx i t ( j )) (cid:83) s ∈ [ t ] A cs ] (cid:105) ≤ k ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) ) d (cid:88) j =1 P ( (cid:91) s ∈ [ t ] A cs )+ 3 d (cid:88) j =1 (cid:104) E [( R ˆ x Q M ,R ,i t − ( j ) − Rx i t − ( j )) (cid:83) s ∈ [ t ] A cs ]+ E [( Rx i t − ( j ) − Rx i t ( j )) (cid:83) s ∈ [ t ] A cs ] (cid:105) ≤ k ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) ) d (cid:88) j =1 P ( (cid:91) s ∈ [ t ] A cs )+3 d (cid:88) j =1 E [( Rx i t − ( j ) − Rx i t ( j )) (cid:83) s ∈ [ t ] A cs ]+3 α i t − ( Q M ,R ) ≤ k ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) ) d (cid:88) j =1 P ( (cid:91) s ∈ [ t ] A cs )+ 3 d (cid:88) j =1 E [( Rx i t − ( j ) − Rx i t ( j )) (cid:83) s ∈ [ t − A cs ]+ 3 d (cid:88) j =1 E [( Rx i t − ( j ) − Rx i t ( j )) A cl ] + 3 α i t − ( Q M ,R ) ≤ k ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) ) d (cid:88) j =1 P ( (cid:91) s ∈ [ t ] A cs )+3∆ i ( t − i t +3 d (cid:88) j =1 E [( Rx i t − ( j ) − Rx i t ( j )) A ct ]+3 α i t − ( Q M ,R ) . (30)Since P ( (cid:91) s ∈ [ t ] A cs ) ≤ t (cid:88) s =1 P ( A cs ) ≤ e − d ∆ (cid:48) i / i + 2 t (cid:88) s =2 e − d ∆ (cid:48) i ( s − is / i ( s − is = 2 t ( √ n ) − , (31)where in the ﬁnal step we use ∆ (cid:48) i = (cid:113) i /d ) ln √ n and ∆ (cid:48) i ( s − i s = (cid:113) i ( s − i s /d ) ln √ n , we get d (cid:88) j =1 E [( R ˆ x Q M ,R ,i t ( j ) − Rx i t ( j )) (cid:83) s ∈ [ t ] A cs ] ≤ tk ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) ) ( √ n ) − + 3∆ i ( t − i t +3 d (cid:88) j =1 E [( Rx i t − ( j ) − Rx i t ( j )) A ct ]+ 3 α i t − ( Q M ,R ) ( a ) ≤ tk ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) ) ( √ n ) − + 3∆ i ( t − i t + 12∆ i ( t − i t (1 + 3 ln √ n )( √ n ) − + 3 α i t − ( Q M ,R ) ≤ t k ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) )( √ n ) − + 3∆ i ( t − i t + 12∆ i ( t − i t (1 + 3 ln √ n )( √ n ) − + 3 α i t − ( Q M ,R ) , (32)where in (a) we use Lemma E [ || ˆ x Q M ,R ,i t − x i t || ] ≤ dt ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) ) + 6 t k ( (cid:15) i + t − (cid:88) s =1 (cid:15) i s i ( s +1) )( √ n ) − +3∆ i ( t − i t +12∆ i ( t − i t (1+3 ln √ n )( √ n ) − +3 α i t − ( Q M ,R ) ( a ) = 24 tA i t ln √ n ( k − + 144 t k A i t ln √ n ( k − n √ n + 3∆ i ( t − i t + 12∆ i ( t − i t (1 + 3 ln √ n )( √ n ) − + 3 α i t − ( Q M ,R ) ( b ) ≤ tA i t ln √ n ( k − + 576 t A i t en + 3∆ i ( t − i t + 36∆ i ( t − i t e / n + 3 α i t − ( Q M ,R ) ≤ lA i t ln √ n ( k − + c t ( n ) A i t + ∆ i ( t − i t n + 3 α i t − ( Q M ,R ) , (33)where in (a) we use (cid:15) = (cid:48) k − and (cid:15) ( s − s = (cid:48) ( s − s k − and in(b) we use (1 + 3 ln √ n ) / √ n ≤ /e / and (ln √ n ) / √ n ≤ /e and A i t = ∆ i + (cid:80) t − s =1 ∆ i s i ( s +1) .ow, we consider β i t ( Q M ,R ) . For β i t ( Q M ,R ) , we have || E [ˆ x Q M ,R ,i t ] − x i t || = || E [ R ˆ x Q M ,R ,i t ] − Rx i t || a ) ≤ d (cid:88) j =1 E [( R ˆ x Q M ,R ,i t ( j ) − Rx i t ( j )) (cid:83) s ∈ [ t ] A cs ] ≤ d (cid:88) j =1 E [( R ˆ x Q M ,R ,i t ( j ) − Rx i t ( j )) (cid:83) s ∈ [ t ] A cs ] ≤ c t ( n ) A i t + ∆ i ( t − i t n + 3 α i t − ( Q M ,R ) , , (34)where (a) holds by the fact that if R ∈ (cid:84) s ∈ [ t ] A s , then ˆ x Q M ,R ,i t is an unbiased estimte of x i t by Lemma 4.For t = 1 , following a similar method above we have E [ || ˆ x Q M ,R ,i − x i || ] ≤ d(cid:15) i + 4 dk (cid:15) i ( √ n ) − + 4(2∆ i + d ∆ (cid:48) i )( √ n ) − ≤ i ln √ n ( k − + ( 96 e ( kk − + 24 e / ) ∆ i n ≤ i ln √ n ( k − + 154 ∆ i n . (35)Similarly, for β i ( Q M ,R ) we have || E [ˆ x Q M ,R ,i ] − x i || ≤

154 ∆ i n . We complete the proof of the lemma according Lemma 3.

B. Proof of Corollary 1

We ﬁrst consider α i l ( Q M ,R ) and β i l ( Q M ,R ) . Then, we prove α i l ( Q M ,R ) ≤ D ii l ln √ n ( k − + 154 D ii l n ,β i l ( Q M ,R ) ≤ D ii l n . We use induction for l . If l = 1 , it is obvious.Now, we suppose α i l − ( Q M ,R ) ≤ D ii l − ln √ n ( k − + 154 D i l − n ,β i l − ( Q M ,R ) ≤ D ii l − n . By (33) and (34), we have α i l ( Q M ,R ) ≤ lA i l ln √ n ( k − + c i l ( n ) A i l + ∆ i ( l − i l n + 3 α i l − ( Q M ,R ) ≤ lA i l + 3 D i il − ) ln √ n ( k − + ( c i l ( n ) A i l + ∆ i ( l − i l

154 + 3 D i il − ) · n and β i l ( Q M ,R ) ≤ c i l ( n ) A i l + ∆ i ( l − i l n + 3 α i l − ( Q M ,R ) ≤ D ii l −

24 ln √ n ( k − + ( c i l ( n ) A i l + ∆ i ( l − i l

154 + 3 D i il − ) · n ≤ ( 3 nD ii l −

24 ln √ n k − + c i l ( n ) A i l + ∆ i ( l − i l

154 + 3 D i il − ) · n ≤ ( 3 nD ii l −

154 + c i l ( n ) A i l + ∆ i ( l − i l

154 + 3 D i il − ) · n , where in the last step, we use the fact

24 ln √ n ( k − ≤ since log k = (cid:100) log(2 + √

12 ln n ) (cid:101) .Since D ii l =max { l (∆ i + l − (cid:88) s =1 ∆ i s i ( s +1) ) , c l ( n )154 (∆ i (36) + l − (cid:88) s =1 ∆ i s i ( s +1) + ∆ i ( l − i l ) + 3 nD ii l − } + 3 D ii l − ,,