[PDF] Adaptive Quantization of Model Updates for Communication-Efficient Federated Learning

Abstract

Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning, especially in bandwidth-limited settings and high-dimensional models. Gradient quantization is an effective way of reducing the number of bits required to communicate each model update, albeit at the cost of having a higher error floor due to the higher variance of the stochastic gradients. In this work, we propose an adaptive quantization strategy called AdaQuantFL that aims to achieve communication efficiency as well as a low error floor by changing the number of quantization levels during the course of training. Experiments on training deep neural networks show that our method can converge in much fewer communicated bits as compared to fixed quantization level setups, with little or no impact on training and test accuracy.

Full PDF

AADAPTIVE QUANTIZATION OF MODEL UPDATESFOR COMMUNICATION-EFFICIENT FEDERATED LEARNING

Divyansh Jhunjhunwala (cid:63)

Advait Gadhikar (cid:63)

Gauri Joshi (cid:63)

Yonina C. Eldar † (cid:63) Carnegie Mellon University, Pittsburgh, USA, { djhunjhu, agadhika, gaurij } @andrew.cmu.edu † Weizmann Institute of Science, Rehovot, Israel, { [email protected] } ABSTRACT

Communication of model updates between client nodesand the central aggregating server is a major bottleneck infederated learning, especially in bandwidth-limited settingsand high-dimensional models. Gradient quantization is an ef-fective way of reducing the number of bits required to com-municate each model update, albeit at the cost of having ahigher error ﬂoor due to the higher variance of the stochasticgradients. In this work, we propose an adaptive quantizationstrategy called AdaQuantFL that aims to achieve communi-cation efﬁciency as well as a low error ﬂoor by changing thenumber of quantization levels during the course of training.Experiments on training deep neural networks show that ourmethod can converge in much fewer communicated bits ascompared to ﬁxed quantization level setups, with little or noimpact on training and test accuracy.

Index Terms — distributed optimization, federated learn-ing, adaptive quantization

1. INTRODUCTION

Distributed machine learning training, which was typicallydone in the data center setting, is rapidly transitioning to theFederated Learning (FL) setting [1] [2], where data is spreadacross a large number of mobile client devices. Due to privacyconcerns, the FL clients perform on-device training and onlyshare model updates with a central server. A major challengein FL is the communication bottleneck due to the limited up-link bandwidth available to the clients.Recent work tackling this problem has taken two majordirections. The ﬁrst approach reduces the load on the com-munication channel by allowing each client to perform mul-tiple local updates [1, 3, 4, 5, 6], thus reducing the com-munication frequency between clients and server. However,this optimization may not be enough due to the large sizeof model updates for high dimensional models, like neuralnetworks. The second approach deals with this problem byusing compression methods to reduce the size of the modelupdate being communicated by the clients at an update step[7, 8, 9, 10, 11, 12, 13]. However, such compression meth-ods usually add to the error ﬂoor of the training objective as they increase the variance of the updates. Thus, one needs tocarefully choose the number of quantization levels in order tostrike the best error-communication trade-off.In this work we propose AdaQuantFL, a strategy to auto-matically adapt the number of quantization levels used to rep-resent a model update and achieve a low error ﬂoor as well ascommunication efﬁciency. The key idea behind our approachis that we bound the convergence of training error in termsof the number of bits communicated, unlike traditional ap-proaches which bound error with respect to number of train-ing rounds (see Fig. 1). We use this convergence analysis toadapt the number of quantization levels during training basedon the current training loss. Our approach can be consideredorthogonal to other proposed methods of adaptive compres-sion such as varying the spacing between quantization levels[14] and reusing outdated gradients [15]. In [16], the authorspropose an adaptive method for tuning the number of local up-dates or the communication frequency. AdaQuantFL is a sim-ilar strategy, but for tuning the number of bits communicatedper round. Our experiments on distributed training of deepneural networks verify that AdaQuantFL is able to achieve agiven target training loss using much fewer bits compared toﬁxed quantization methods.

2. SYSTEM MODEL

Consider a system of n clients and a central aggregatingserver. Each client i has a dataset D i of size m i consistingof labeled samples ξ ( i ) j = ( x ( i ) j , y ( i ) j ) for j = 1 , . . . , m i .The goal is to train a common global model, represented bythe parameter vector w ∈ R d , by minimizing the followingobjective function: min w ∈ R d  f ( w ) = n (cid:88) i =1 p i f i ( w ) = n (cid:88) i =1 p i m i m i (cid:88) j =1 (cid:96) ( w ; ξ ( i ) j )  , (1)where p i = m i (cid:80) ni =1 m i is the fraction of data held at the i -thclient and f i ( w ) is the empirical risk at the i -th client for apossibly non-convex loss function (cid:96) ( w ; ξ ( i ) j ) . a r X i v : . [ c s . L G ] F e b s s s s s s s small s large s T r a i n i ng L o ss T r a i n i ng L o ss Number of Training Rounds Number of Bits Communicated small s large s Changing the X-axis

Adaptive s k* communicationinterval Fig. 1 : Viewing training in terms of bits communicated.

Quantized Local SGD.

The model is trained iteratively usingthe local stochastic gradient descent (local SGD) algorithm,proposed in [6, 3]. In local SGD, the entire training processis divided into rounds consisting of τ local updates at eachclient. At the beginning of the k -th round, each client readsthe current global model w k from the central server and up-dates it by performing τ local SGD steps for t = 0 , · · · , τ − as follows: w ( i ) k,t +1 = w ( i ) k,t − ηg i ( w ( i ) k,t , ξ ( i ) ) , (2)where w ( i ) k, = w k and g i ( w ( i ) k,t , ξ ( i ) ) is the stochastic gradientcomputed using a mini-batch ξ ( i ) sampled uniformly at ran-dom from the i -th client local dataset D i . After completing τ steps of local SGD, each client sends its update for the k -thround denoted by ∆ w ( i ) k = w ( i ) k,τ − w ( i ) k, , to the central server.In order to save on bits communicated over the bandwidth-limited uplink channel, each client only sends a quantizedupdate Q (∆ w ( i ) k ) , where Q ( · ) represents a stochastic quan-tization operator over R d . Once the server has received thequantized updates from all the clients, the global model is up-dated as follows. w k +1 = w k + n (cid:88) i =1 p i Q (∆ w ( i ) k ) . (3) Stochastic Uniform Quantizer.

In this work we consider thecommonly used [7, 17, 18] stochastic uniform quantizationoperator Q s ( w ) , which is parameterized by the number ofquantization levels s ∈ N = { , , . . . } . For each dimensionof a d -dimensional parameter vector w = [ w , . . . , w d ] , Q s ( w i ) = (cid:107) w (cid:107) sign ( w i ) ζ i ( w , s ) , (4)where ζ i ( w , s ) is a random variable given as, ζ i ( w , s ) = (cid:40) l +1 s with probability | w i |(cid:107) w (cid:107) s − l ls otherwise . (5)Here, l ∈ { , , , ..s − } is an integer such that | w i |(cid:107) w (cid:107) ∈ [ ls , l +1 s ) . For w = , we deﬁne Q s ( w ) = . Given Q s ( w i ) , we need bit to represent sign ( w i ) and (cid:100) log ( s + 1) (cid:101) bits to represent ζ i ( w , s ) . The scalar (cid:107) w (cid:107) is usually represented with full precision, which we assumeto be bits. Thus, the number of bits communicated by aclient to the central server per round, which we denote by C s ,is given by C s = d (cid:100) log ( s + 1) (cid:101) + d + 32 . (6)It can be shown from the work of [7, 18] that while Q s ( w ) re-mains unbiased for all s , i.e., E [ Q s ( w ) | w ] = w , the varianceof Q s ( w ) decreases with s because of the following varianceupper bound: E [ (cid:107) Q s ( w ) − w (cid:107) | w ] ≤ ds (cid:107) w (cid:107) . (7)From (6) and (7), we see that varying s results in a trade-off between the total number of bits communicated C s and thevariance upper bound – C s increases with s while the vari-ance upper bound in (7) decreases with s . Building on thisobservation, in the next section, we analyze the effect of s onthe error convergence speed and use it to design a strategy toadapt s during the course of training.

3. TRADE-OFF BETWEEN ERROR AND THENUMBER OF BITS COMMUNICATED

The motivation behind adapting the number of quantizationlevels s during training can be understood through the illus-tration in Fig. 1. In the left plot, we see that a smaller s , that is,coarser quantization, results in worse convergence of trainingloss versus the number of training rounds. However, a smaller s reduces the number of bits C s communicated per round. Toaccount for this communication reduction, we change the x-axis to the number of bits communicated in the right plot ofFig. 1. This plot reveals that smaller s enables us to performmore rounds for the same number of bits communicated, lead-ing to a faster initial drop in training loss. The intuition be-hind our adaptive algorithm is to start with a small s and thengradually increase s as training progresses to reach a lower er-ror ﬂoor. To formalize this, we provide below a convergencebound on the training loss versus the number of bits commu-nicated for any given s . Convergence Bound in terms of Error versus Number ofBits Communicated.

For a non-convex objective function f ( w ) , it is common to look at the expected squared normof the gradient of the objective function as the error metricwe want to bound [19]. We analyze this quantity under thefollowing standard assumptions. Assumption 1.

The stochastic quantization operator Q ( . ) isunbiased and its variance is at most some positive constant q times the squared (cid:96) norm of its argument, i.e. ∀ w ∈ R d , E [ Q ( w ) | w ] = w and E [ (cid:107) Q ( w ) − w (cid:107) | w ] ≤ q (cid:107) w (cid:107) . ssumption 2. The local objective functions f i are L − smooth,i.e. ∀ w , w (cid:48) ∈ R d , (cid:107)∇ f i ( w ) − ∇ f i ( w (cid:48) ) (cid:107) ≤ L (cid:107) w − w (cid:48) (cid:107) . Assumption 3.

The stochastic gradients computed at theclients are unbiased and their variance is bounded, that is, forall w ∈ R d , E [ g i ( w , ξ ( i ) )] = ∇ f i ( w ) and E [ (cid:107) g i ( w , ξ ( i ) ) −∇ f i ( w ) (cid:107) ] ≤ σ . Assumption 4.

Each client i has a dataset D i of m samplesdrawn independently from the same distribution (i.i.d data). Under these assumptions, the authors in [ ] recentlyderived a convergence bound for the FL setup describedin Section 2 for non-convex (cid:96) ( · ; · ) . We use this result forAdaQuantFL, however in practice our algorithm can also besuccessfully applied without Assumption 4 (non-i.i.d data) asseen in our experiments Section 5. Also while the existingresult [17] studies the error convergence with respect to thenumber of training rounds, we bound the same error in termsof number of bits communicated, deﬁned as follows. Deﬁnition 1 (Number of Bits Communicated, B ) . The totalnumber of bits that have been communicated by a client to thecentral server until a given time instant is denoted by B . Since all clients participate in a training round and followthe same quantization protocol, B is same for all clients atany instant. We also note that the stochastic uniform quantizerhaving s quantization levels, satisﬁes Assumption 1 with q = ds [7, 18]. Now using this deﬁnition of B and our earlierdeﬁnition of C s in (6) we get the following theorem: Theorem 1.

Under Assumptions 1-4, take Q ( . ) to be thestochastic uniform quantizer with s quantization levels. If thelearning rate satisﬁes − ηL (1 + dτs n ) − η L τ ( τ − ≥ ,then we have the following error upper bound in terms of B : C s Bτ ( B/C s ) − (cid:88) k =0 τ − (cid:88) t =0 E [ (cid:107) f ( ¯ w k,t ) (cid:107) ] ≤ A log (4 s ) + A s + A . (8) Here, ¯ w k,t = n (cid:80) ni =1 w ( i ) k,t denotes the averaged modelacross all clients at each step, and A = 2( f ( w ) − f ∗ ) dηBτ , A = ηLdσ n ,A = η σ ( τ − L ( n + 1) n + ηLσ n + A d + 32 d , (9) and w is a random point of initialization and f ∗ is the mini-mum value of our objective. The proof of Theorem 1 is deferred to Appendix A. Thiserror bound allows us to see the trade-off between coarse andaggressive quantization seen in Section 3, for different valuesof s . As we decrease s , the value of the ﬁrst term in ourerror bound ( A log (4 s ) ) decreases but it also adds to thevariance of our quantized updates which increases the secondterm ( A /s ) .

4. PROPOSED ADAQUANTFL STRATEGY

Our proposed algorithm aims at adaptively changing the num-ber of quantization levels s in the stochastic uniform quantizersuch that the error upper bound in Theorem 1 is minimized atevery value B . To do so, we discretize the entire training pro-cess into uniform communication intervals, where in each in-terval we communicate B bits (see Fig. 1). We now discusshow to ﬁnd the optimal s for each such interval. Finding optimal s for each communication interval. Wepropose selecting an s at any B (assuming w as the pointof initialization) by setting the derivative of our error upperbound in (8) to zero. Doing so, we get a closed form solutionof an optimal s as: s ∗ = (cid:115) η Lσ τ B log e (2) n ( f ( w ) − f ∗ ) . (10)Now at the beginning of the k -th communication intervalclients can be viewed as restarting training at a new initializa-tion point w = w k . Using (10) we see that the optimal s forcommunicating the next B bits is given by, s ∗ k = (cid:115) η Lσ τ B log e (2) n ( f ( w k ) − f ∗ ) (11)As f ( w k ) becomes smaller the value of s ∗ k increases whichsupports our intuition that we should increase s as trainingprogresses. However, in practice, parameters such as L , σ and f ∗ are unknown. Hence, in order to obtain a practicallyusable schedule for s ∗ k , we assume f ∗ = 0 and divide s ∗ k by s ∗ to get the approximate adaptive rule: s ∗ k ≈ (cid:115) f ( w ) f ( w k ) s ∗ . (12)The value of s ∗ can be found via grid search (we found s ∗ = 2 to be a good choice in our experiments). Variable Learning Rate.

Our analysis so far assumed theexistence of a ﬁxed learning rate η . In practice, we may wantto decrease the learning rate as training progresses for bet-ter convergence. By extending the above analysis, we get anadaptive schedule of s for a given learning rate schedule:AdaQuantFL: s ∗ k ≈ (cid:115) η k f ( w ) η f ( w k ) s ∗ . (13)Here, η is the initial learning rate and η k is the learning ratein the k -th interval. In terms of the number of bits used torepresent each element in the model update, in the k -th inter-val, AdaQuantFL uses b ∗ k = (cid:100) log ( s ∗ k + 1) (cid:101) bits (excludingthe sign bit). .0 0.5 1.0 1.5 2.010 T r a i n i ng L o ss AdaQuantFL2-bit4-bit8-bit16-bit

Bits Communicated (Gb) b *k (a) ResNet-18 with ﬁxed LR, i.i.d data T r a i n i ng L o ss AdaQuantFL2-bit4-bit8-bit16-bit

Bits Communicated (Gb) b *k (b) ResNet-18 variable LR, i.i.d data T r a i n i ng L o ss AdaQuantFL2-bit4-bit8-bit16-bit

Bits Communicated (Gb) b *k (c) ResNet-18 with ﬁxed LR, non-i.i.d data Fig. 2 : AdaQuantFL on ResNet-18 requires a fewer bits to reach a lower loss threshold, in (a) AdaQuantFL reaches a loss of0.02 in 0.3Gb while the 2-bit method takes 1.8Gb. Here b ∗ k = (cid:100) log ( s ∗ k + 1) (cid:101) (deﬁned in Section 4). T r a i n i ng L o ss AdaQuantFL2-bit4-bit8-bit16-bit

Bits Communicated (Mb) b *k (a) CNN with ﬁxed LR, i.i.d data T r a i n i ng L o ss AdaQuantFL2-bit4-bit8-bit16-bit

Bits Communicated (Mb) b *k (b) CNN with variable LR, i.i.d data T r a i n i ng L o ss AdaQuantFL2-bit4-bit8-bit16-bit

Bits Communicated (Mb) b *k (c) CNN with ﬁxed LR, non-i.i.d data Fig. 3 : For the Vanilla CNN, AdaQuantFL is able to achieve the lowest error ﬂoor of 0.02 for the non-i.i.d data distribution,while other methods converge at a higher error ﬂoor. Here b ∗ k = (cid:100) log ( s ∗ k + 1) (cid:101) (deﬁned in Section 4).

5. EXPERIMENTAL RESULTS

We evaluate the performance of AdaQuantFL against ﬁxedquantization schemes using b = { , , , } bits respectivelyto represent each element of the model update (excluding thesign bit) using the stochastic uniform quantizer. The perfor-mance is measured on classiﬁcation of the CIFAR-10 [20]and Fashion MNIST [21] datasets using ResNet-18 [22] anda Vanilla CNN architecture [1] (referred as CNN here on) re-spectively. For all our experiments we set the number of localupdates to be τ = 10 , η = 0 . and train our algorithm over clients for the ResNet-18 and clients for the CNN. Forthe variable learning rate setting, we reduce the learning rateby a factor of . every training rounds. We run our ex-periments on both i.i.d and non-i.i.d distributions of data overclients. Our experimental results verify that AdaQuantFL is able to reach an error ﬂoor using much fewer bits in mostcases as seen in Fig. 2 and Fig. 3. Additional details and ﬁg-ures, including test accuracy plots can be found in AppendixD.

6. CONCLUSION

In this paper we present AdaQuantFL, a strategy to adapt thenumber of quantization levels used to represent compressedmodel updates in federated learning. AdaQuantFL is basedon a rigorous error vs bits convergence analysis. Our experi-ments show that AdaQuantFL requires fewer bits to convergeduring training. A natural extension of AdaQuantFL wouldbe using other quantizers such as the stochastic rotated quan-tizer [18] and the universal vector quantizer [9]. . REFERENCES [1] H. Brendan McMahan, Eider Moore, Daniel Ram-age, Seth Hampson, and Blaise Agøura y Arcas,“Communication-Efﬁcient Learning of Deep Networksfrom Decentralized Data,”

International Conference onArtiﬁcial Intelligenece and Statistics (AISTATS) , Apr.2017.[2] Peter Kairouz, H Brendan McMahan, Brendan Avent,Aur´elien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji,Keith Bonawitz, Zachary Charles, Graham Cormode,Rachel Cummings, et al., “Advances and openproblems in federated learning,” arXiv preprintarXiv:1912.04977 , 2019.[3] Jianyu Wang and Gauri Joshi, “CooperativeSGD: Unifying Temporal and Spatial Strategies forCommunication-Efﬁcient Distributed SGD,” preprint ,Aug. 2018.[4] Jianyu Wang, Hao Liang, and Gauri Joshi, “Overlaplocal-SGD: An algorithmic approach to hide commu-nication delays in distributed SGD,” arXiv preprintarXiv:2002.09539 , 2020.[5] Farzin Haddadpour, Mohammad Mahdi Kamani,Mehrdad Mahdavi, and Viveck Cadambe, “Local sgdwith periodic averaging: Tighter analysis and adap-tive synchronization,” in

Advances in Neural Informa-tion Processing Systems 32 , H. Wallach, H. Larochelle,A. Beygelzimer, F. d Alch´e-Buc, E. Fox, and R. Garnett,Eds., pp. 11082–11094. Curran Associates, Inc., 2019.[6] Sebastian U Stich, “Local sgd converges fast and com-municates little,” arXiv preprint arXiv:1805.09767 ,2018.[7] Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka,and Milan Vojnovic, “Qsgd: Communication-efﬁcientsgd via gradient quantization and encoding,” in

Ad-vances in Neural Information Processing Systems , 2017,pp. 1709–1720.[8] Jakub Koneˇcn`y, H Brendan McMahan, Daniel Ram-age, and Peter Richt´arik, “Federated optimization: Dis-tributed machine learning for on-device intelligence,” arXiv preprint arXiv:1610.02527 , 2016.[9] Nir Shlezinger, Mingzhe Chen, Yonina C Eldar, H Vin-cent Poor, and Shuguang Cui, “Uveqfed: Universal vec-tor quantization for federated learning,” arXiv preprintarXiv:2006.03262 , 2020.[10] Wei Wen, Cong Xu, Feng Yan, Chunpeng Wu, YandanWang, Yiran Chen, and Hai Li, “Terngrad: Ternarygradients to reduce communication in distributed deeplearning,” arXiv preprint arXiv:1705.07878 , May 2017. [11] Venkata Gandikota, Daniel Kane, Raj Kumar Maity, andArya Mazumdar, “vqsgd: Vector quantized stochasticgradient descent,” 2019.[12] Hongyi Wang, Scott Sievert, Shengchao Liu, ZacharyCharles, Dimitris Papailiopoulos, and Stephen Wright,“Atomo: Communication-efﬁcient learning via atomicsparsiﬁcation,” in

Advances in Neural Information Pro-cessing Systems , 2018, pp. 9850–9861.[13] Thijs Vogels, Sai Praneeth Karimireddy, and MartinJaggi, “Powersgd: Practical low-rank gradient compres-sion for distributed optimization,” in

Advances in Neu-ral Information Processing Systems , 2019, pp. 14259–14268.[14] Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Al-istarh, Daniel Roy, and Ali Ramezani-Kebrya, “Adap-tive gradient quantization for data-parallel sgd,” arXivpreprint arXiv:2010.12460 , 2020.[15] Jun Sun, Tianyi Chen, Georgios B Giannakis, andZaiyue Yang, “Communication-efﬁcient distributedlearning via lazily aggregated quantized gradients,” arXiv preprint arXiv:1909.07588 , 2019.[16] Jianyu Wang and Gauri Joshi, “Adaptive Communi-cation Strategies for Best Error-Runtime Trade-offs inCommunication-Efﬁcient Distributed SGD,” in

Pro-ceedings of the SysML Conference , Apr. 2019.[17] Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Has-sani, Ali Jadbabaie, and Ramtin Pedarsani, “Fedpaq:A communication-efﬁcient federated learning methodwith periodic averaging and quantization,” in

Interna-tional Conference on Artiﬁcial Intelligence and Statis-tics , 2020, pp. 2021–2031.[18] Ananda Theertha Suresh, X Yu Felix, Sanjiv Kumar, andH Brendan McMahan, “Distributed mean estimationwith limited communication,” in

International Confer-ence on Machine Learning , 2017, pp. 3329–3337.[19] L´eon Bottou, Frank E Curtis, and Jorge Nocedal, “Op-timization methods for large-scale machine learning,” arXiv preprint arXiv:1606.04838 , Feb. 2018.[20] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton,“Cifar-10 (canadian institute for advanced research),” .[21] Han Xiao, Kashif Rasul, and Roland Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machinelearning algorithms,” arXiv preprint arXiv:1708.07747 ,2017.[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residuallearning for image recognition,” in , 2016, pp. 770–778.23] Adam Paszke, Sam Gross, Francisco Massa, AdamLerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, Luca Antiga,Alban Desmaison, Andreas Kopf, Edward Yang,Zachary DeVito, Martin Raison, Alykhan Tejani,Sasank Chilamkurthy, Benoit Steiner, Lu Fang, JunjieBai, and Soumith Chintala, “Pytorch: An imperativestyle, high-performance deep learning library,” in

Ad-vances in Neural Information Processing Systems 32 ,H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch´e-Buc, E. Fox, and R. Garnett, Eds., pp. 8024–8035. Cur-ran Associates, Inc., 2019.

PPENDIX

A. PROOF OF THEOREM 1

We ﬁrst adapt the following result from [17] which states under Assumptions 1-4, for sufﬁciently small η such that, − ηL (1 + qτn ) − η L τ ( τ − ≥ we have after K rounds of training, Kτ K − (cid:88) k =0 τ − (cid:88) t =0 E [ (cid:107) f ( ¯ w k,t ) (cid:107) ] ≤ f ( w ) − f ∗ ) ηKτ + ηL (1 + q ) σ n + η σ n ( n + 1)( τ − L where ¯ w k,t = n (cid:80) ni =1 w ( i ) k,t denotes the averaged model across all clients at each step. We note that the above result holds forany stochastic quantization operator Q () that satisﬁes Assumption 1 with arbitrary q .We assume that the stochastic uniform quantizer Q s ( w ) satisﬁes Assumption 1 with q = q s . Now by our deﬁnition of B , wecan write K = BC s (assuming B mod C s = 0). Doing so, we get, C s Bτ ( B/C s ) − (cid:88) k =0 τ − (cid:88) t =0 E [ (cid:107) f ( ¯ w k,t ) (cid:107) ] ≤ f ( w ) − f ∗ ) C s ηBτ + ηL (1 + q s ) σ n + η σ n ( n + 1)( τ − L Now substituting C s = d (cid:100) log ( s + 1) (cid:101) + d + 32 (using (6)) and q s = ds (using (7)) in RHS of the last inequality we get, C s Bτ ( B/C s ) − (cid:88) k =0 τ − (cid:88) t =0 E [ (cid:107) f ( ¯ w k,t ) (cid:107) ] ≤ A (cid:100) log ( s + 1) (cid:101) + A s + A ≤ A log (4 s ) + A s + A where the last inequality follows from the fact that for s ≥ we have (cid:100) log ( s + 1) (cid:101) ≤ log (4 s ) . The constant A , A and A are deﬁned as follows, A = 2( f ( w ) − f ∗ ) dηBτ , A = ηLdσ n ,A = η σ ( τ − L ( n + 1) n + ηLσ n + A d + 32 d , (14)This completes the proof for Theorem 1. B. PROOF OF EQN. (10)

Let F ( s ) be the objective which we want to minimize. We have, F ( s ) = A log (4 s ) + A s + A Now taking the ﬁrst derivative we have, ∇ F ( s ) = ˆ A s − A s where ˆ A = A log ( e ) .Upon setting ∇ F ( s ) = 0 we get s = (cid:113) A ˆ A as one of the solutions. We see that for s ∈ (0 , (cid:113) A ˆ A ) , F ( s ) is decreasing as ∇ F ( s ) < and for s ∈ ( (cid:113) A ˆ A , ∞ ) , F ( s ) is increasing as ∇ F ( s ) > . This implies we get a global minima of F ( s ) at s = (cid:113) A ˆ A . Substituting back the values of ˆ A and A we get, s ∗ = (cid:115) η Lσ Bτ log e (2) n ( f ( w ) − f ∗ ) . CONVERGENCE GUARANTEE FOR ADAQUANTFL We now provide a convergence guarantee for AdaQuantFL. In order to do so, we ﬁrst state the following theorem.

Theorem 2 (Adaptive Quantization and Variable Learning Rate Error Bound) . Assuming K to be the total number of trainingrounds and η k , s k to be the values of the learning rate and the quantization level in the k -th training round respectively, if thefollowing condition is satisﬁed, ∀ k ∈ { , · · · , K − } : 1 − η k L (cid:18) dτns k (cid:19) − η L τ ( τ − ≥ we have under Assumptions 1-4, E (cid:34) (cid:80) K − k =0 η k (cid:80) τ − t =0 (cid:107) f ( w k,t ) (cid:107) (cid:80) K − k =0 η k (cid:35) ≤ O (cid:32) (cid:80) K − k =0 η k (cid:33) + O (cid:32) (cid:80) K − k =0 η k (cid:80) K − k =0 η k (cid:33) + O (cid:32) (cid:80) K − k =0 η k (cid:80) K − k =0 η k (cid:33) + O (cid:32) (cid:80) K − k =0 η k ( d/s k ) (cid:80) K − k =0 η k (cid:33) . (15) Proof:

We note here that the subscript k refers to the index of the communication round, in contrast to Section 4 where it referred tothe index of the communication interval.We also note that for the k -th training round the stochastic uniform quantizer with s k levels satisﬁes Assumption 1 with q = ds k .We now use the following result from [17] (modiﬁed for the stochastic uniform quantizer) which states that under Assumptions1-4, for the k -th training round if we have, − η k L (1 + dτs k n ) − η L τ ( τ − ≥ (16)then, E [ f ( w k +1 )] ≤ E [ f ( w k )] − η k τ − (cid:88) t =0 E [ (cid:107)∇ f ( w k,t ) (cid:107) ] + η k Lτ σ n + η k ( d/s k ) Lτ σ n + η k σ ( n + 1) τ ( τ − L n (17)We now assume (16) holds for all k ∈ { , · · · , K − } . Summing over all rounds k ∈ { , · · · , K − } and after minorrearranging of terms we get, E (cid:34) K − (cid:88) k =0 η k τ − (cid:88) t =0 (cid:107)∇ f ( w k,t ) (cid:107) (cid:35) ≤ f ( w ) − f ∗ + Lτ σ (cid:80) K − k =0 η k n + Lτ σ (cid:80) K − k =0 η k ( d/s k )2 n + σ ( n + 1) τ ( τ − L (cid:80) K − k =0 η k n . (18)Dividing both sides by (cid:80) K − k =0 η k we have, E (cid:34) (cid:80) K − k =0 η k (cid:80) τ − t =0 (cid:107)∇ f ( w k,t ) (cid:107) (cid:80) K − k =0 η k (cid:35) ≤ f ( w ) − f ∗ ) (cid:80) K − k =0 η k + Lτ σ (cid:80) K − k =0 η k n (cid:80) K − k =0 η k + σ ( n + 1) τ ( τ − L (cid:80) K − k =0 η k n (cid:80) K − k =0 η k + Lτ σ (cid:80) K − k =0 η k ( d/s k ) n (cid:80) K − k =0 η k (19) = O (cid:32) (cid:80) K − k =0 η k (cid:33) + O (cid:32) (cid:80) K − k =0 η k (cid:80) K − k =0 η k (cid:33) + O (cid:32) (cid:80) K − k =0 η k (cid:80) K − k =0 η k (cid:33) + O (cid:32) (cid:80) K − k =0 η k ( d/s k ) (cid:80) K − k =0 η k (cid:33) . (20)This completes the proof for Theorem 2. .1. Proof of Convergence: We assume the following conditions hold true, lim K →∞ K − (cid:88) k =0 η k → ∞ , lim K →∞ K − (cid:88) k =0 η k < ∞ , lim K →∞ K − (cid:88) k =0 η k < ∞ (21)Now a sufﬁcient condition for the upper bound in (16) to converge to zero as K → ∞ is, lim K →∞ K − (cid:88) k =0 η k ( d/s k ) < ∞ (22)Since the number of quantization levels s k will be greater than or equal to 1 for any training round, we have lim K →∞ K − (cid:88) k =0 η k ( d/s k ) ≤ d lim K →∞ K − (cid:88) k =0 η k < ∞ . (23)This implies as K → ∞ we have, E (cid:34) (cid:80) K − k =0 η k (cid:80) τ − t =0 (cid:107)∇ f ( w k,t ) (cid:107) (cid:80) K − k =0 η k (cid:35) → (24)This completes the proof of convergence. D. ADDITIONAL RESULTS

In this section, we provide further details of our experiments and some additional results. Figures 4 and 5 show the test accu-racies for the experiments on the ResNet-18 and CNN that we trained on FMNIST and CIFAR-10 respectively. AdaQuantFLis able to achieve a test accuracy of 69.12% for the ResNet-18 experiment shown in Fig. 4(a), whereas the 16-bit quantizationmethod achieves 69.52%. For the CNN experiment shown in Fig. 5(a), AdaQuantFL reaches a test accuracy of 91.15%, whilethe 16-bit method reaches 91.01%. T e s t A cc u r ac y AdaQuantFL2 bit4 bit8 bit16 bit

Bits Communicated (Gb) b *k (a) ﬁxed LR, i.i.d data T e s t A cc u r ac y AdaQuantFL2 bit4 bit8 bit16 bit

Bits Communicated (Gb) b *k (b) variable LR, i.i.d data T e s t A cc u r ac y AdaQuantFL2 bit4 bit8 bit16 bit

Bits Communicated (Gb) b *k (c) ﬁxed LR, non-i.i.d data Fig. 4 : Test Accuracy vs the number of bits communicated for ResNet-18 on CIFAR-10For the non i.i.d settings, each dataset was sorted according to the target class labels and then partitioned equally among clients.In all experiments we ﬁx B = 16 d where d is the dimension of our parameter vector. The CNN architecture is inspiredfrom [1], and consists of 2 convolutional layers with 32 and 64 channels, each followed by a max-pool and ReLU layer. Theconvolutional layers are followed by a linear layer of 512 with a ReLU activation and then the output softmax layer. Allexperiments were implemented in PyTorch [23] with a ‘gloo’ distributed backend on a NVIDIA TitanX GPU.

20 40 60 80 100 1200.10.20.30.40.50.60.70.80.9 T e s t A cc u r ac y AdaQuantFL2 bit4 bit8 bit16 bit

Bits Communicated (Mb) b *k (a) ﬁxed LR, i.i.d data T e s t A cc u r ac y AdaQuantFL2 bit4 bit8 bit16 bit

Bits Communicated (Mb) b *k (b) variable LR, i.i.d data T e s t A cc u r ac y AdaQuantFL2 bit4 bit8 bit16 bit

Bits Communicated (Mb) b *k (c) ﬁxed LR, non-i.i.d data Fig. 5 : Test Accuracy vs the number of bits communicated for Vanilla CNNWe observe that in the case of a variable learning rate, AdaQuantFL does well for the ResNet-18 experiment shown in Fig. 2(b)but cannot do better than the 4-bit setting for the CNN experiment shown in Fig. 3 (b). As observed from (13), a decreasinglearning rate schedule tries to reduce s ∗ kk