[PDF] Ternary Compression for Communication-Efficient Federated Learning

Abstract

Learning over massive data stored in different locations is essential in many real-world applications. However, sharing data is full of challenges due to the increasing demands of privacy and security with the growing use of smart mobile devices and IoT devices. Federated learning provides a potential solution to privacy-preserving and secure machine learning, by means of jointly training a global model without uploading data distributed on multiple devices to a central server. However, most existing work on federated learning adopts machine learning models with full-precision weights, and almost all these models contain a large number of redundant parameters that do not need to be transmitted to the server, consuming an excessive amount of communication costs. To address this issue, we propose a federated trained ternary quantization (FTTQ) algorithm, which optimizes the quantized networks on the clients through a self-learning quantization factor. A convergence proof of the quantization factor and the unbiasedness of FTTQ is given. In addition, we propose a ternary federated averaging protocol (T-FedAvg) to reduce the upstream and downstream communication of federated learning systems. Empirical experiments are conducted to train widely used deep learning models on publicly available datasets, and our results demonstrate the effectiveness of FTTQ and T-FedAvg compared with the canonical federated learning algorithms in reducing communication costs and maintaining the learning performance.

Full PDF

DDRAFT 1

Ternary Compression for Communication-EfﬁcientFederated Learning

Jinjin Xu, Wenli Du, Ran Cheng, Wangli He,

Senior member, IEEE , and Yaochu Jin,

Fellow, IEEE

Abstract —Learning over massive data stored in differentlocations is essential in many real-world applications. However,sharing data is full of challenges due to the increasing demandsof privacy and security with the growing use of smart mobiledevices and IoT devices. Federated learning provides a potentialsolution to privacy-preserving and secure machine learning, bymeans of jointly training a global model without uploading datadistributed on multiple devices to a central server. However, mostexisting work on federated learning adopts machine learningmodels with full-precision weights, and almost all these modelscontain a large number of redundant parameters that do not needto be transmitted to the server, consuming an excessive amountof communication costs. To address this issue, we propose afederated trained ternary quantization (FTTQ) algorithm, whichoptimizes the quantized networks on the clients through a self-learning quantization factor. A convergence proof of the quanti-zation factor and the unbiasedness of FTTQ is given. In addition,we propose a ternary federated averaging protocol (T-FedAvg)to reduce the upstream and downstream communication offederated learning systems. Empirical experiments are conductedto train widely used deep learning models on publicly availabledatasets, and our results demonstrate the effectiveness of FTTQand T-FedAvg compared with the canonical federated learningalgorithms in reducing communication costs and maintaining thelearning performance.

Index Terms —Deep learning, federated learning, communica-tion efﬁciency, ternary coding.

I. I

NTRODUCTION T HE number of Internet of Things (IoTs) and smartmobile devices deployed, e.g., in process industry hasgrown dramatically over the past decades, having generatedmassive amounts of data stored distributively every moment.Meanwhile, recent achievements in deep learning [1], such

Manuscript received xx, 2020; revised xxx, 2020. This work was supportedby National Natural Science Foundation of China (Basic Science CenterProgram: 61988101), International (Regional) Cooperation and ExchangeProject (1720106008), National Natural Science Foundation of China (Ma-jor Program: 61590923), National Natural Science Fund for DistinguishedYoung Scholars (61725301), National Natural Science Foundation of China(Major Program: 61590922) and China Scholarship Council (201906745025).(Corresponding authors: Wenli Du; Yaochu Jin.)Jinjin Xu, Wenli Du and Wangli He are with the Key Laboratory ofAdvanced Control and Optimization for Chemical Processes, Ministry of Ed-ucation, East China University of Science and Technology, Shanghai, 200237,China, and also with Shanghai Institute of Intelligent Science and Technology,Tongji University, Shanghai, 200092, China E-mail: [email protected],[email protected], [email protected] Cheng is the Shenzhen Key Laboratory of Computational Intelli-gence, University Key Laboratory of Evolving Intelligent Systems of Guang-dong Province, Department of Computer Science and Engineering, SouthernUniversity of Science and Technology, Shenzhen 518055, China. Email:[email protected] Jin is with the Department of Computer Science, University ofSurrey, Guildford, GU27XH, UK. E-mail:[email protected]. as AlphaGo [2], rely heavily on the knowledge stored in bigdata. Naturally, to adopt deep learning methods for effectiveutilization of the rich data contained in local clients of theprocess industry, e.g., branch factories, will provide a strongsupport to industrial production. However, training deep learn-ing models by distributed data is difﬁcult, while uploadingprivate data to cloud is controversial, due to limitations onnetwork bandwidth, budgets and security regulations, e.g.,GDPR [3].Many research efforts have been devoted to related ﬁeldsrecently, while early work in this area has mainly focusedon training deep models on multiple machines to alleviatethe computational burden of large data volumes, known asdistributed machine learning [4]. These methods have achievedsatisfactory performance, by splitting big data into tiny setsto accelerate the model training process. For example, dataparallelism [4], model parallelism [4], [5] (see in Fig. 1)and parameter server [6]–[8] are commonly used methods inpractice. Correspondingly, the weights optimization strategiesfor multiple machines have also been proposed. For example,Zhang et al. [9] proposed asynchronous mini-batch stochasticgradient descent algorithm (ASGD) on multi-GPU devicesfor deep neural networks (DNNs) training and achieved 3.2times speed-up on 4 GPUs than the single one without loss ofprecision. Recently, distributed machine learning algorithmsfor multiple datacenters which are located in different regionshave been studied in [10]. However, little attention has beenpaid to the data security and the impact of data distributionon the performance.To address the drawbacks of distributed learning, re-searchers have proposed an interesting framework to train aglobal model while keeping the private data locally, known asfederated learning [11]–[13]. The federated approach makesit possible to extract knowledge in the data distributed onlocal devices without uploading private data to a certain server.Fig. 2 illustrates the simpliﬁed workﬂow and an applicationscenario in the process industry. Several extensions have beenintroduced to the standard federated learning system. Zhao etal. [14] have observed weight divergence caused by extremedata distributions and proposed a method of sharing a smallamount of data with other clients to enhance the perfor-mance of federated learning algorithms. Wang et al. [15] haveproposed adaptive federated learning systems under a givenresource budget based on a control algorithm that balancesthe client update and global aggregation, and analyzed theconvergence bound of the distributed gradient descent. Recentcomprehensive overviews of federated leanring can be found in[16], [17], and design ideas, the challenges and future research a r X i v : . [ c s . L G ] M a r RAFT 2 directions of federated learning on massively mobile devicesare presented in [18], [19].Since users may pay more attention to privacy protectionand data security, federated learning will play a key role indeep learning, although it faces challenges in terms of datadistribution and communication costs.

1) Data Distribution:

The data generated by differentclients, e.g., factories, may be unbalanced and not subject tothe independent and identical distribution hypothesis, whichmeans unbalanced and/or non-IID datasets.

2) Communication Costs:

Federated learning is inﬂuencedby the rapidly growing depth of model and the amount ofmultiply-accumulate operations (MACs) [20]. This is due tothe fact that the massive communication costs for uploadingand downloading is necessary, while the average upload anddownload speed is asymmetric, e.g., 26.36 Mbps mean mobiledownload speed vs. 11.05 Mbps upload speed of UK in Q3-Q42017 [21].

Client 1Client 3 Client 2Client 4

Fig. 1. The diagram of model parallelism. The complete model is distributedand stored in multiple clients, and the model is stitched after the trainingprocedure is ﬁnished on the clients.

Obviously, high communication costs is one of the mainreasons hindering distributed and federated training. Althoughthe initial model compression related research is not intendedto reduce the communication costs, it has been a source ofinspiration for communication efﬁcient distributed learning.Neural network pruning is an early method of model com-pression proposed in [22]. Parameter pruning and sharing in[22], [23], low-rank factorization [24], transferred/compactconvolutional ﬁlters [25] and knowledge distillation in [26]are some main ideas reported in the literature. Reductionof communication costs by simultaneously minimazing theperformance and minimizing the model complexity using anmulti-objective evolutionary algorithm is reported in [27].Recently, an layer-wise asynchronous model update approachhas been proposed in [28] to reduce the number of parametersto be transmitted.Gradient quantization has been proposed to accelerate dataparallelism distributed learning [29]; gradient sparsiﬁcation[30] and gradient quantization [31], [32] have been developedto reduce the model size; an efﬁcient federated learning by sparse ternary compression (STC) has been proposed [33],which is robust to non-IID data and communication-efﬁcienton both upstream and downstream communications. However,since the STC is a model compression method after localtraining is completed, the quantization process is not optimizedduring training.To the best of our knowledge, most of the federated learningmethods emphasize the application of full-precision networksor streamline models after the training procedure on client isﬁnished, rather than simplifying the model during training.Therefore, deploying the federated learning environment inthe widely used IoT devices in the process industry is some-how difﬁcult. To address this issue, we focus on the modelcompression on clients during training to reduce the energyconsumption at inference stage and the communication costsof federated learning. The main contributions of this paper aresummarized as follows:

Fig. 2. An illustration to the application of federated learning in processindustry. The branch factories train local models on private clients data throughiterations, and send the trained local models to the server for aggregation toobtain optimized the global model. • A ternary quantization approach is introduced into thetraining and inference stages of clients. The trainedternary models are well suited for inference in networkedge devices, e.g., wireless sensors. • A ternary federated learning protocol is presented to re-duce the communication costs between clients and server,which compresses both upstream and downstream com-munications. Note that the quantiﬁcation of the modelweights can further enhance privacy protection since thismakes the reverse engineering of the model more difﬁcult. • A theoretical analysis of the proposed algorithm is pro-vided and the performance of the algorithm is empiricallyveriﬁed on a deep feedforward network (MLP) and a deepresidual network (ResNet) using widely used datasets,namely MNIST [34] and CIFAR10 [35].The remainder of this paper is organized as follows. InSection II, we brieﬂy review the standard federated learningprotocol and several widely used network quantization ap-proaches. Section III proposes a method to quantify modelsof clients in federated learning systems, called federatedtrained ternary quantization (FTTQ), based on the quantitativealgorithms mentioned earlier. On the basis of FTTQ, a ternaryfederated learning protocol that reduces both upstream anddownstream communications is presented. In Section IV, a

RAFT 3 theoretical analysis of the proposed algorithm is provided.Experimental settings and results are presented in Section V tocompare the new protocol with standard algorithms. Finally,conclusions and future directions are given in Section VI.II. B

ACKGROUND AND M ETHODS

In this section, we ﬁrst introduce some preliminaries ofthe standard federated learning workﬂow and its basic for-mulations. Subsequently, the deﬁnitions and main features ofpopular ternary quantization methods are presented, followedby an numeric example.

A. Federated Learning Protocol

It is usually assumed that the data used by a distributedlearning algorithm belongs to the same feature space, whichmay not be true in federated learning. As illustrated in Fig.3, the basic protocol of federated learning proposed in [18]is round-based. Speciﬁcally, private storage, server and clients(usually mobile devices) are main participants in the wholeprotocol, and there are three main phases in each round:selection, conﬁguration and reporting.

CheckCheckCheck

Selection Configuration

TrianingTrianing

Reporting

Check Trianing Aggregation round

Fig. 3. Federated learning workﬂow with massive mobile devices. Firstly, theserver selects suitable clients to deploy conﬁguration (global model structure)for training. Then the clients complete the learning process in the speciﬁedbudget (e.g., time) and return the local models to the server. Finally, the serveraggregate all local models to obtain the trained global model.

In this work, we assume supervised learning is used for trainthe models. The global model with parameters θ , deployed indistributed client k , is trained based on local training dataset D k , which consists of training sample pairs ( x i , y i ), ( i =1 , , , ..., | D k | ) . The loss of sample pair ( x i , y i ) is denotedby l ( x i , y i ; θ ) , where l is the loss function. The total loss ofa certain task on client k is J : J k ( θ ) = 1 | D k | | D k | (cid:88) i l ( x i , y i ; θ ) . (1)We assume there are N clients whose data is stored inde-pendently, and the aim of federated learning is to minimizethe global loss L . Therefore, the global objective function ofthe federated learning system can be deﬁned as: L ( θ ) = λN (cid:88) k | D k | (cid:80) λNk | D k | J k ( θ ) , (2) where λ is the proportion of the clients participating in theaggregation in the current round, which is determined by theabove three phases.Theoretically, the participation ratio λ in (2) is calculatedby the number of participating clients and the total number ofclients. Additionally, the local batch size B and local epochs E are also important hyperparameters. In the experiment,we will further study the effects of these parameters on theperformance of the algorithm by manually setting the valuesof them.It is easy to ﬁnd that the communication costs are heavilydependent on the amount of information to be transferredbetween the server and clients, and the dominating factorin this procedure is the size of the parameters. One of therequirements for communication-efﬁcient federated learning tobe fulﬁlled is that both upstream and downstream communica-tions need to be compressed [33]. Note that the performanceof federated learning may dramatically drop due to dispropor-tionation of data distribution. B. Quantization

Quantization improves energy and space efﬁciency of deepnetworks by reducing the number of bits per weight [36].This is done by mapping the parameters in a continuousspace to a quantization discrete space, which can greatlyreduce model redundancy and save memory overhead. It hasbeen shown that the ternary weight network (TWN for short)[32] is able to reduce the Euclidean distance between thequantization parameters and θ t (consisting of -1, +1 and 0) andthe full-precision parameters θ by a scaling factor α comparedwith binary networks [31], thus making the accuracy of thequantization network close to the full-precision network: α ∗ , θ t ∗ = arg min α,θ t ∗ F ( α, θ t ) = (cid:107) θ − αθ t (cid:107) , (3)where F represents the cost function of this optimization prob-lem; α ∗ and θ t ∗ are optimal solution to F , with θ ≈ α ∗ θ t ∗ .Li et al. [32] introduce an approximated optimal solution witha threshold-based function to quantify all layers of the deepneural network model: θ tl =  +1 , θ l > ∆ l , | θ l | ≤ ∆ l − , θ l < − ∆ l , (4)where the θ l and θ tl are the full-precision and quantizedweights of l th layer, and θ = { θ , θ , ..., θ l , ... } , θ t = { θ t , θ t , ..., θ tl , ... } , respectively, which provide a rule of thumbto calculate the optimal ∆ ∗ = { ∆ ∗ , ∆ ∗ , ..., ∆ ∗ l , ... } .However, the weights of TWN are limited to -1, 0, 1 and α ∗ is a constant. In order to further improve the performance ofquantized deep networks while maintaining the compressionratio, Zhu et al. [37] have proposed a trained ternary quanti-zation algorithm (TTQ for short). In TTQ, two quantizationfactors (positive factor w p and negative factor w n ) are adoptedto scale the ternary weights in each layer. RAFT 4 - ---- - Loss gradientsgradientsNormalize ---- - - - Δ = 0.55 -1-1 -1 -1-1 -1 -1 3 0 -2

300 33 0-2-2 -2 -2-2 -2 -2

W_p=3 W_n=2

Fig. 4. An example of how the TTQ algorithm works. Firstly, normalizedfull-precision weights and biases are quantized to { -1, 0, +1 } by the giventhreshold per layer. Secondly, positive and negative quantiﬁcation factors areused to scale the quantized weights. Finally, the calculated gradients are back-propagated to each layer. The right part in the dotted rectangle represents theinference stage. The workﬂow of TTQ is illustrated by Fig. 4, the normal-ized full-precision weights are quantized by ∆ l , w pl and w nl with full-precision activations. Instead of using the optimizedthreshold in TWN, TTQ adopts a heuristic method to calculate ∆ l : ∆ l = t × max( | θ l | ) , (5)where t is a constant factor determined by experience and ∆ = { ∆ , ∆ , ..., ∆ l , ... } .III. P ROPOSED A LGORITHM

In this section, we ﬁrst propose a federated trained ternaryquantization (FTTQ for short) to reduce the energy consump-tion in each client during inference and the upstream anddownstream communications. Subsequently, a ternary feder-ated averaging protocol (T-FedAvg for short) is suggested.

A. Federated Trained Ternary Quantization

Since no direct data exchange is usually allowed betweenclients in the federated learning system, weight divergence [14]may be different among clients. For example in the l th layerof the global model shared by client C and client C , if max( | θ C l | ) = 5 and max( | θ C l | ) = 50, it is not necessarilytrue that the global model will be biased towards C if weuse the same factor for the two models. To address this issue,we start by scaling the weights to [-1, 1]: θ s = g ( θ ) , (6)where g is a scaling function, R n → [ − , . However,magnitude imbalance [38] may be introduced when scalingthe entire θ of a certain network, thus resulting in signiﬁcantloss of precision, since most of the elements are pushed tozero. Therefore, we scale the weights layer by layer.Then, by using the same strategy as TTQ, we calculate thequantization threshold ∆ according to the scaled weights asfollows: ∆ = T k × max( | θ s | ) , (7)where T k is a hyperparameter with a default setting on theclient k , and ∆ = { ∆ , ∆ , ..., ∆ l , ... } .However, according to (6) and (7), we can easily ﬁndthat the thresholds in all layers are mostly the same since the maximum absolute value of the scaled θ s is 1 in mostlayers. Thus, the model capability may be effected by thehomogeneity of the threshold. To avoid this issue, we proposean alternative threshold calculation criterion: ∆ = T k m m (cid:88) i ( | θ si | ) , (8)where m is the number of neurons and ∆ is layer-wise calcu-lated. Obviously, the threshold obtained by (8) is inﬂuencedby the layer sparsity and can be seen as an extension of (7)as: ∆ = T k × m m (cid:88) i ( | θ si | ) ≤ T k × m ( m × max | θ s | ) ≤ T k × m ( m × ≤ T k . (9)Notably, the threshold will turn into the optimal solutionproposed in [32] if we set the value of T k to 0.7. Thecalculation method of ∆ is generally adjusted according tothe performance.Subsequently, several operations are taken to achieve layer-wise weight quantization to overcome the computation burdenand reduce the communication costs: mask = ε ( | θ s | − ∆) , (10) I t = sign( mask (cid:12) θ s ) , (11) θ t = w q × I t , (12)where ε is the step function and (cid:12) is the Hadamard product, w q = [ w q , w q , ..., w ql , ... ] is an independent quantization vectorwhich is trained together with other parameters layer by layer,and I t is the quantized ternary weights. Consequently, themask matrix can be rewritten as a union of a positive index I p and a negative index I n of the local model: I p = { i | θ i > ∆ } , (13) I n = { i | θ i < − ∆ } . (14)Different from the standard TTQ, we adopt a quantizationfactor which is updated with its gradients together with otherparameters instead of the previous two quantization factors ineach layer, mainly due to the following reasons.

1) Stability:

Large weight divergence will be encounteredafter synchronization if participating clients are initialized withdifferent parameters, which leads to performance degeneration[14]. Hence, the weight divergence should be minimized ateach layer in the federated learning environment.

RAFT 5

2) Energy Consumption:

We present an proposition aboutthe convergence trends of w p and w n and its proof in SectionIV, followed by the experimental results on the two factorswith different initials when training MLP and ResNet ∗ inAppendix A. It is worth noting that the trend of the positiveand negative quantization factors of TTQ algorithm is almostthe same in all layers. Hence, the energy consumption incalculating the gradients of two quantization factors during theback propagation can be cut in half if only one quantizationfactor is retained, which is important for some resource-constrained clients.After quantifying the whole network, the loss function canbe calculated and the errors be backpropagated in the sameway as for continuous weights except that the weights are ± w q or zeros. The gradients of w q and latent full-precision modelare calculated according to the rules in [37]. The new updaterule is summarized in Algorithm 1. Consequently, FTTQsigniﬁcantly reduces the size of the updates transmitted to theserver, thus reducing the costs of upstream communications.However, the costs of the downstream communications willnot be reduced if no additional measures are taken, since theweights of the global model cannot be decomposed into thecoefﬁcient and ternary matrix after aggregation. To addressthis issue, a ternary federated learning protocol is presented inthe next section. Algorithm 1:

Federated Trained Ternary Quantization(FTTQ)

Input:

Full-precision parameters θ and quantizationvector w q , loss function l , dataset D with samplepairs ( x i , y i ) , i = { , , ..., | D |} , learning rate η . Output:

Quantiﬁed model θ t init: All clients parameters are initialized with θ . for ( x i , y i ) ∈ D do θ s ← g ( θ ) mask ← ε ( | θ s | − ∆) I t = sign( mask (cid:12) θ s ) θ t ← w q × I t J ← l i ( x i , y i ; θ t ) ∂J∂w q ← (cid:80) i ∈ I p ∂J∂θ ti w q ← w q + η ∂J∂w q θ t ← θ t + η ∂J∂θ t update θ endReturn θ t (including w q , I t ) B. Ternary Federated Averaging

The two-step scheme of the proposed ternary federatedaveraging protocol with private data is elaborated in Fig. 5.In general, the participating clients quantize the normalizedlocal models and upload the thresholds, quantization factorsand ternary models to the server. Then the server aggregatesall local models to obtain the global model. Finally, the server quantiﬁes the normalized global model again using ﬁxedthresholds and pushes the quantized global model to all clients.The basic ﬂow is described as follows.

1) Upstream:

Let K = {

1, 2,..., | λN |} be the set of indiceswhich represent the randomly selected clients that participatein the aggregation, where λ is the participation ratio and N is the total number of the clients in the federated learningsystem. The local scaled full-precision and quantized weightsof client k ∈ K are denoted by θ sk and θ tk , respectively. Weupload the trained θ tk ( w q and I t ) to the server instead ofthe updates ∇ θ k after local iterations, although the two areequivalent [11]. And at inference stage, only the quantizedmodel is needed for prediction.

2) Downstream:

After r communication rounds, the serverwill rebuild all models received from the participated clients,and the global model can be calculated by θ r +1 ← (cid:80) λNk =1 | D k | (cid:80) λNk =1 | D k | θ tk . Then the server will quantify the globalmodel again with a constant threshold ∆ using a default settingof 0.05 and push the quantized model to the clients. Algorithm 2:

Ternary Federated Averaging

Input:

Initial global model parameters θ Init:

Broadcast θ to clients k , k = { , , , ..., N } , assigneach client a unique dataset D k . for round r = T dofor c ∈ K = { , , ..., | λN |} in parallel doClient k does: download quantiﬁed θ tr − initialize w q θ tk,r ← FTTQ ( θ tr − , w q )upload θ tk,r to server endendServer does: θ r ← (cid:80) λNk =1 | D k | (cid:80) λNk =1 | D k | θ tc,r mask ← ε ( | θ r | − ∆) θ tr ← sign( mask (cid:12) θ r ) broadcast θ tr to all clients endend Unlike standard federated learning algorithms, our methodcompresses communications during the upload and downloadphases, which brings major advantages when deploying DNNsat the inference stage for resource-constrained devices. Specif-ically, the clients move the local networks from 32-bit to 2-bitand push the 2-bit networks and quantiﬁcation parameters tothe server, and then download the quantized global model fromthe server at the end. For example, if we conﬁgure a federatedlearning environment involving 20 clients and a global modelthat requires 25 MB of storage space, the total communicationcosts of the standard federated learning is about 1 GB perround (upload and download). By contrast, our method reducesthe costs to 65 MB per round (upload and download), which is

RAFT 6 Clients Upload Server D q w q w q w D D Aggregate Normalize Quantize Download

A N Q DANDDD QQQ QQQNNN

Fig. 5. The diagram of proposed T-FedAvg. The blue part runs on the clients with normalized full-precision weights, which is similar to the standard federatedlearning framework; then the quantiﬁcation factors, thresholds and ternary local models are pushed to the server, as shown in the orange part; after that, theglobal model is obtained by the server aggregation and normalization; ﬁnally, the global model is quantized and pushed back to all clients. about 1/16 of the standard method. Note that quantifying theglobal model pushed to the clients makes reverse engineeringmore difﬁcult. The overall workﬂow of the proposed ternaryfederated protocol is summarized in Algorithm 2.IV. T

HEORETICAL ANALYSIS

In this section, we ﬁrst formally demonstrate the propertiesof two quantization factors w p , w n in TTQ, followed by aproof of unbiasedness of FTTQ and T-FedAvg.By default in this paper, the subscripts represent the indicesof the elements in a network instead of the indices of theclients in the federated learning system. A. The Convergence of Quantization Factors in TTQ

The experimental results of the convergence proﬁles of w p and w n of two widely used neural networks are presented inAppendix A, showing that the two factors have converged tothe same value. To theoretically prove the convergence, we atﬁrst introduce the following assumption. Assumption 4.1:

The elements in scaled full-precision θ areuniformly distributed between 0 and 1, ∀ θ i ∈ θ, θ i (cid:118) U ( − , . (15)Then we have the following proposition. Proposition 4.1:

Given an one-layer online gradient system,each element of its parameters is initialized with a symmetricdistribution centered at 0, e.g., θ i (cid:118) U ( − , , which is quantized by TTQ with two iterative factors w p , w n and aﬁxed threshold ∆ , then we have: lim e → + ∞ w p = lim e → + ∞ w n , (16)where e is the training epoch and w p , w n , ∆ > . Proof 4.1:

The converged w ∗ p and w ∗ n can be regarded as theoptimal solution of the quantization factors, which can reducethe Euclidean distance between the full-precision weights θ and the quantized weights θ t , which is equal to w p I p − w n I n .Then we have: w ∗ p , w ∗ n = arg min w p ,w n (cid:107) θ − w p I p + w n I n (cid:107) , (17)where I p = { i | θ i ≥ ∆ } , I n = { j | θ j ≤ − ∆ } and I z = { k || θ k | < ∆ } , and according to (4) we have θ − w p I p + w n I n =  θ i − w p , i ∈ I p θ k , k ∈ I z θ j + w n , j ∈ I n . (18)Then the original problem can be transformed to (cid:107) θ − w p I p + w n I n (cid:107) = (cid:88) i ∈ I p ( θ i − w p ) + (cid:88) j ∈ I n ( θ j + w n ) + (cid:88) k ∈ I z θ k = | I p | w p + | I n | w n − w p (cid:88) i ∈ I p θ i + 2 w n (cid:88) j ∈ I n θ j + C, (19) RAFT 7 where C = (cid:80) i ∈ I p θ i + (cid:80) j ∈ I n θ j + (cid:80) k ∈ I z θ k is a constant independentof w p and w n . Hence the optimal solution of (19) can beobtained when w ∗ p = 1 | I p | (cid:88) i ∈ I p θ i ,w ∗ n = − | I n | (cid:88) j ∈ I n θ j . (20)Since the weights are distributed symmetrically, w ∗ p and w ∗ n will converge to the same value. This completes the proof. B. The Unbiasedness of FTTQ

Here, we ﬁrst prove the unbiasedness of FTTQ. To simplifythe original problem, we adopt an assumption that is commonin network initialization.With Assumption 4.1, we prove Proposition 4.2:

Proposition 4.2:

Let θ be the local scaled network param-eters deﬁned in Assumption 4.1 of one client in a given fed-erated learning system. If θ is quantiﬁed by FTTQ algorithm,then we have E [ F T T Q ( θ )] = E ( θ ) . (21) Proof 4.2:

According to (20), w ∗ q is calculated by theelements in I p = { k | θ k ≥ ∆ } , where ∆ is a ﬁxed number oncethe parameters are generated under Assumption 4.1, hencethe elements indexed by I p obey a new uniform distributionbetween ∆ and , then we have ∀ k ∈ I p , θ k (cid:118) U (∆ , , (22)therefore, the probability density function f of θ k ( k ∈ I p )can be regarded as f ( x ) = − ∆ .According to Proposition 4.1 and (20), we have: E ( w q ∗ ) = E  | I p | (cid:88) k ∈ I p θ k  = 1 | I p | E  (cid:88) k ∈ I p θ k  = 1 | I p | | I p | (cid:90) θ i f ( θ i ) = (cid:90) θ i f ( θ i )= 1 + ∆2 , (23)where θ i is an arbitrary element in I p and | I p | represents thenumber of elements in I p .We know that E [ F T T Q ( θ )] = E [ w q ∗ × sign ( mask ( θ ) × θ )]= E ( w q ∗ ) E { sign ( θ ) E [ mask ( θ )] } , (24)and since E [ mask ( θ )] = P [ mask ( θ ) = 1)] × P [ mask ( θ ) = 0] × P [ mask ( θ ) = 0] × ( − − ∆2 × × − ∆2 × ( − , (25) hence E [ F T T Q ( θ )] = E ( w q ∗ ) E { sign ( θ ) E [ mask ( θ )] } = 1 + ∆2 × , (26)and under Assumption 4.1, we have E ( θ ) = (cid:90) − θdθ = 0 , (27)then it is immediate that E [ F T T Q ( θ )] = E ( θ ) . (28)Hence, the FTTQ quantizer output can be considered as anunbiased estimator of the input [39]. We can guarantee theunbiasedness of FTTQ in federated learning systems whenthe weights are uniformly distributed. Furthermore, since thedistribution of network weights in most of layers may be non-uniform due to the stochastic errors from data, the self-learningfactor w q and ∆ may reduce the quantization errors, which canbe regarded as a non-uniform sampling method. C. The Properties of T-FedAvg

Here, we adopt the following assumption to demonstratethe properties of T-FedAvg, which is widely used rules in theliterature [14].

Assumption 4.2:

When a federated learning system with K clients and one server is established, all clients will beinitialized with the same global model.Under the above assumption, Zhao et al. [14] proved thefollowing conclusions: Lemma 4.1:

The weight divergence which leads to theperformance degeneration after r rounds of synchronizationbetween the clients and the server mainly comes from twoparts, including the weight divergence after r − roundsof aggregation, i.e., || θ fr − − θ cr − || (superscript c denotesthe centralized setting), and the weight divergence resultingfrom the Earth mover’s distance (EMD) between the datadistribution on each client and the actual distribution of thewhole data population. Lemma 4.2:

When all the clients are initialized with thesame global model, the EMD between the data distribution oneach client and the distribution of the whole data populationbecomes the main cause of the performance degeneration.Since our method is unbiased and can reduce the Euclideandistance between the quantized network and the full-precisionnetwork, we can conclude that T-FedAvg can also performwell if the original FedAvg converges to the optimal solution.V. E

XPERIMENTAL R ESULTS

This section evaluates the performance of the proposedmethod on widely used benchmark datasets. We set upmultiple controlled experiments to examine the performancecompared with the standard federated learning algorithm interms of the test accuracy and communication costs. In thefollowing, we present the experimental settings and the results.

RAFT 8

A. Settings

To evaluate the performance of the proposed network quan-tization and ternary protocol in federated learning systems, weﬁrst conduct experiments with 10 independent physical clientsconnected by a Local Area Network (LAN). Then, we test theobtained model in the simulation environment with the numberof clients varying from 10 to 100.The physical system consists of four CPU laptops thatare connected wirelessly through LAN to mimic low-powermobile devices, one of which acts as the server aggregationmodel and the remaining laptops act as clients participatingin the federated training. Each client only communicates withthe server and there is no information exchange between theclients.For simulations, we typically use 10 clients for experimentsaccording to the number of classes in the datasets. A detaileddescription of the conﬁguration is given below.1)

Compared algorithms . In this work, we compare thefollowing algorithms: • Baseline: the centralized learning algorithm, such asstochastic gradient descent (SGD) method, whichmeans that all data is stored in a single computingcenter and the model is trained directly using the entiredata. • FedAvg: the canonical federated learning approachpresented in [11]. • TTQ: the canonical trained ternary quantizationmethod, in which the conﬁguration is the same as thebaseline, i.e., the data is stored in a centralized mannerand a model is trained using all the data. • T-FedAvg: our proposed quantized federated learningapproach.2)

Datasets . We select two representative benchmark datasetsthat are widely used for classiﬁcation, and no data augmenta-tion method is used in all experiments. • MNIST [34]: it contains 60000 training and 10000testing gray-scale handwritten image samples with 10classes, where the dimension of each image is 28 × • CIFAR10 [35]: it contains 60000 colored images of 10types of objects from frogs to planes, 50000 for trainingand 10000 for testing. It is a widely used benchmarkdata set that is difﬁcult to extract features.2)

Models . To evaluate the performance of above algorithms,two popular deep learning models are selected: MLP andResNet ∗ , which represent tiny and large models, respectively.The detailed setting are as follows: • MLP: it is mainly used for training small data sets, e.g.,MNIST. The model contains two hidden layers withthe number of neurons of 30 and 20, respectively. Forcentralized and distributed training, the learning rate η is set to the same and the ReLU function is selectedas the activation function. • ResNet18 ∗ : it is a simpliﬁed version of the widely usedResNet [40], where the number of input and outputchannels for all convolutional layers is reduced to 64.It is a typical benchmark model for evaluating theperformance of algorithms on large data sets.3) Data distribution . The performance of federated learning isaffected by the features of training data stored on the separatedclients. To investigate the impact of different data distributions,several types of data are generated: • IID data: each client holds an IID subset of datacontaining 10 classes, thus having a IID-subset of thedata. • Non-IID data: the union of the samples in all clients isthe entire dataset, but the number of classes containedin each client is not equal to the total number ofcategories in the entire dataset (10 for MNIST andCIFAR10). We can use the label to assign samples of N c classes to each client, where N c is the numberof classes per client. In case of extremely non-IIDdata, N c is equal to 1 for each client, but this caseis generally not considered since there is no need totrain (e.g., classiﬁcation) if only one class is stored oneach client. • Unbalancedness in data size: typically, the size of thedatasets on different clients varies a lot. To investigatethe inﬂuence of the unbalancedness in data size inthe federated learning environment, we split the entiredataset into several distinct parts.3)

Basic conﬁguration . The basic conﬁguration of federatedlearning system in our experiments is set as follows: • Total number of clients: N = 100. • The participation ratio per round: λ = 0.1. • Classes per client: N c = 10. • Local batch size: B = 64. • Local epochs: E = 5.All experimental settings are summarized in Table I. TABLE IM

ODELS AND HYPERPARAMETERS . Models MLP ResNet ∗ Dataset MNIST CIFAR10Optimizer SGD AdamLearning rate 0.0001 0.008Baseline 92.75% 86.30%Parameter amount 24330 607050

The learning rate of the centralized and federated learningalgorithms is the same and remains constant during the train-ing. Note that a small learning rate is set for training MLP toslow down the convergence speed for easy observation.

RAFT 9

B. Performance on IID Data

In this part, we conduct experiments on IID MNIST andCIFAR10 using the benchmark algorithms with MLP andResNet ∗ mentioned above, where the baseline and TTQ arerepresentatives of centralized approaches.Speciﬁcally, the data used by the centralized methods isstored centrally in one computing center, while the data usedby FedAvg and T-FedAvg is stored separately. To explorethe best performance of FedAvg and T-FedAvg, the federatedlearning environment is set with 10 fully participating clientsand each client holds an IID subset of data containing 10classes.The results and model weight width are summarized inTable II. We can see that the test accuracies achieved by thebaseline algorithm and TTQ are 92.75%, 92.87%, respectively,when trained on MNIST with MLP, and 86.30%, 85.73%,respectively, on CIFAR10 with ResNet ∗ . TTQ has a slightdeterioration in performance on CIFAR10 when the modelcomplexity is reduced. TABLE IIT

EST ACCURACIES ACHIEVED AND WEIGHTS WIDTH OF DIFFERENTALGORITHMS WHEN TRAINED ON

IID

DATA

Methods MNIST CIFAR10Accuracy Width Accuracy WidthBaseline 92.75% 32 bit 86.30% 32 bitFedAvg 92.37% 32 bit 85.72% 32 bitTTQ 92.87% 2 bit 85.73% 2 bitT-FedAvg 92.75% 2 bit 86.60% 2 bit

When the data distribution is IID, FedAvg achieves 92.37%and 85.72% accuracies, respectively, on MNIST and CI-FAR10. However, the test accuracy of T-FedAvg, which isabout 1/16 of the full-precision model size, achieves 92.75%on MNIST and the highest on CIFAR10, 86.60%. It is worthnoting that, as the depth of the network deepens, the quantiza-tion error is declining and may even exceed the accuracy of theoriginal model, which can be evidenced by the performanceof T-FedAvg on ResNet.The convergence speeds of the four methods in differentlocal iterations are illustrated in Fig. 6, where the centralizedmethods, the baseline and TTQ obtain convergence curvescorresponding to the federated learning environment by theinterval of λ ∗ N ∗ E . Overall, the convergence speed of ourmethod is the fastest when trained on MNIST and is slightlyslower than FedAvg on CIFAR10 in the initial phase, whichis up to the performance of TTQ. 5RXQGV 7 H V W $ FF X U D F \ 0/301,67 %DVHOLQH774)HG$YJ7)HG$YJ 5RXQGV 7 H V W $ FF X U D F \ 5HV1HW &,)$5 %DVHOLQH774)HG$YJ7)HG$YJ Fig. 6. Convergence speed of the four compared algorithms over communi-cation rounds or epochs with the same models.

The test accuracies achieved by FedAvg and T-FedAvg forvarious batch sizes are shown in Fig. 7. We notice that ourmethod outperforms FedAvg for a small batch size. Sincesmall batches mean more iterations and can thus reducequantization errors, it is beneﬁcial for clients with limitedcomputing resources. However, the performance of T-FedAvgis not robust enough compared to FedAvg in big local batchsize. This may be attributed to the fact that insufﬁcient modeltraining has resulted in an accumulation in quantization errorswhen the local batch size is large. %DWFK6L]H 7 H V W $ FF X U D F \ 01,67,,' )HG$YJ7)HG$YJ %DWFK6L]H 7 H V W $ FF X U D F \ &,)$5,,' )HG$YJ7)HG$YJ Fig. 7. Maximum accuracies achieved by FedAvg and T-FedAvg when train-ing MLP on MNIST (left) and ResNet18 ∗ on CIFAR10 (right), respectively,for 100 rounds with different batch sizes. Ten clients are involved for allexperiments in full participation. C. Performance with Different N c The boxplots of data distributions with different N c aredepicted in Fig. 9, where the y-axis represents the samplelabel (0-9). As shown in the ﬁgure, the original distribution oftraining and test data are IID. Speciﬁcally, the data distributionis IID in the case of N c = 10, which means that each clienthas a IID subset of the entire dataset (refer to the plot right).However, when the N c = 2, the samples on each client aredivided according to the label, which is non-IID, and has nooverlap with other clients. Similarly, the samples in all clientsare non-IID when N c = 5, but there are some overlaps in databetween clients. Clearly, the data distribution can be regardedas non-IID when N c is smaller than the number of total classescategories of the training data. In this case, the local stochasticgradient cannot be considered as an unbiased estimate of theglobal gradient with non-IID settings. RAFT 10 5RXQGV 7 H V W $ FF X U D F \ 01,670/31F %DVHOLQH)HG$YJ7)HG$YJ 5RXQGV 7 H V W $ FF X U D F \ &,)$55HV1HW 1F %DVHOLQH)HG$YJ7)HG$YJ 5RXQGV 7 H V W $ FF X U D F \ 01,670/31F %DVHOLQH)HG$YJ7)HG$YJ 5RXQGV 7 H V W $ FF X U D F \ &,)$55HV1HW 1F %DVHOLQH)HG$YJ7)HG$YJ Fig. 8. Test accuracies achieved by MLP and ResNet ∗ trained on non-IID MNIST and CIFAR10 after ﬁxed rounds of FedAvg and T-FedAvg. Theparticipation ratio is ﬁxed to 1 and the training data is split among the clientswith different N c . Although FedAvg and T-FedAvg have achieved satisfactorytest accuracies on IID data, a signiﬁcant degradation of the testperformance of FedAvg and T-FedAvg is observed on non-IID data, which is illustrated in Fig. 8 and Table. III. Notethat 10 clients are selected with full participation and no pre-training model is used during the training, which means thatthe participating ratio λ is 1 for the purpose of investigatingthe impact of N c for the non-IID setting. TABLE IIIT

EST ACCURACIES ACHIEVED OVER NON -IID

DATA FOR DIFFERENT N c . Methods MNIST CIFAR10 N c = 2 N c = 5 N c = 2 N c = 5FedAvg 86.69% 87.17% 52.10% 74.21%T-FedAvg 87.1% 87.22% 52.35% 74.43% However, as mentioned in the previous work [11], federatedlearning suffers from extremely non-IID data distribution.When N c = 2 and 5, each client is randomly assigned2 and 5 classes, respectively, both methods work well onMNIST, although there is an acceptable performance degrada-tion. Nevertheless, a signiﬁcant reduction in the test accuracyon CIFAR10 is observed for both methods when N c = 2,and increasing N c from 2 to 5 can effectively alleviate thedegeneration.As we all know, the intricate features of CIFAR10 makes itmore difﬁcult to train the model than MNIST. Therefore, theperformance gap on MNIST when N c = 2 and 5 is smaller thanthat on CIFAR10. And during the training process, T-FedAvgoutperforms the standard FedAvg in terms of convergence speed, and is similar to that of the baseline method at theearlier stage.Theoretically, since T-FedAvg could reduce the upstreamand downstream communication costs, we can increase thenumber of clients or communication rounds within the sameconstraint of budgets to alleviate the performance degener-ation. Recently, a method which shares partial selected IIDdata to alleviate the performance degeneration has also beenproposed [14]. Although the method has certain limitations(e.g., the way of generating the shared dataset and the over-ﬁtting problems introduced by the shared dataset), it is still anpromising solution. D. Inﬂuence of the Participation Ratio λ We investigate the effect of λ on T-FedAvg in this subsec-tion. We ﬁx the total number of the clients and the local batchsizes to 100 and 64, respectively, throughout all experiments.Here, the experiments are done using MLP only, since therobustness of MLP to non-IID data (see in Fig. 8) can reducethe effect of model selection. Fig. 10 shows the test accuraciesachieved by T-FedAvg during the training on IID and non-IIDMNIST in the federated learning environment with differentparticipation ratios ( λ ). 5RXQGV 7 H V W $ FF X U D F \ 01,670/3,,'1F λ λ λ λ 5RXQGV 01,670/3QRQ,,'1F λ λ λ λ Fig. 10. Test accuracies achieved by T-FedAvg when training MLP on MNISTwith IID distribution (left) and non-IID distribution (right) in ﬁxed rounds atdifferent participation ratios (0.1, 0.3, 0.5, 0.7).

As we can see, T-FedAvg is relatvely robust to the changesof the participation ratio λ over IID and non-IID data. Al-though reducing participation ratios λ has negative effects onthe convergence speed and convergence values achieved inthe ﬁxed rounds, the negative effects are more pronounced onnon-IID data (refer to the right panel of Fig. 10). A similarphenomenon has also been observed in [33]. We surmisethat the performance degradation on non-IID data is heavilydependent on whether the features on the clients selected bythe server for model aggregation are representative in thefederated learning environment. It is common to increase theparticipation ratio to alleviate the negative impacts. E. Inﬂuence of Unbalancedness in Dada Size

All experiments above were performed with a balanced splitof the data, where all clients were assigned the same numberof samples. In the following, we investigate the performanceof the proposed algorithm on the unbalancedness in the datasize [33]. If we use S N = {| D | , | D | , ..., | D N |} to represent RAFT 11 7UDLQ 7HVW &OLHQW&OLHQW&OLHQW / D E H O 1F 7UDLQ 7HVW &OLHQW&OLHQW&OLHQW 1F 7UDLQ 7HVW &OLHQW&OLHQW&OLHQW 1F Fig. 9. Data distributions with different N c . When N c =2, there is no overlap in data between clients and each client contains two categories (left). When N c =5, the samples on 10 clients are sampled by labels but there are some overlap between clients (middle). When N c =10, the samples on 10 clients aregenerated by randomly sampling (right). Note that Only 3 clients are shown in the ﬁgures when the N c is 2, 5 or 10. the set of number of samples on N clients, we can deﬁne thedegree of unbalanceness by the ratio β : β = median { S N } max { S N } , (29)where the median of S N is sometimes helpful to accommodatelong tailed distributions and possible outliers [41].When β = 0.1, most of the samples are stored on afew clients, and when β = 1, almost all clients store thesame number of samples. To simulate the unbalanced datadistribution, we vary β from 0.1 to 1, with an average of 30 outof 100 clients participating. And the test accuracies achievedby FedAvg and T-FedAvg for various β are illustrated in Fig.11. ȕ 7 H V W $ FF X U D F \ 01,67 )HG$YJ7)HG$YJ ȕ 7 H V W $ FF X U D F \ &,)$5 )HG$YJ7)HG$YJ Fig. 11. Test accuracies achieved by MLP on MNIST and ResNet ∗ onCIFAR10 after 400 rounds of iterations with FedAvg and T-FedAvg, wherethe local batch size and participation ratio are set to 32 and 0.3, respectively. We can see that the unbalancedness in data size does nothave a signiﬁcant impact on the performance of federatedlearning. This is due to the fact that even when the datais mostly concentrated on some clients, both algorithms canachieve satisfactory performance.

F. Comparison of Communication Costs

In this subsection, we compared the communication costs ofFedAvg and T-FedAvg for a ﬁxed round number. The learningenvironment is conﬁgured the same as in Section IV. Sinceboth algorithms have achieved convergence within 100 rounds on all datasets, we ﬁx the round number to 100. The resultsare shown in Table IV.

TABLE IVT

HE TOTAL MEMORY REQUIRED TO ACHIEVE A CERTAIN TARGETED TESTACCURACY ON DIFFERENT TASKS IN AN

IID

SETTING WITHIN

ROUNDS . N

OTE THAT OUT OF

CLIENTS ( λ = 0.1) PARTICIPATINGTHE AGGREGATION AFTER LOCAL TRAINING EPOCHS IN ONE ROUND . Methods MLP ResNet ∗ Upload Download Upload DownloadFedAvg 742.49 MB 742.49 MB 18525.70 MB 18525.70 MBT-FedAvg 46.41 MB 46.41 MB 1157.86 MB 1157.86 MB

We can see that the communication costs of T-FedAvgare reduced by nearly 94% in the upload and downloadphases compared to the standard FedAvg. To the best of ourknowledge, no such a signiﬁcant communications compressionlevel in the download phase has been achieved in federatedlearning. VI. C

ONCLUSIONS AND F UTURE W ORK

Federated learning is effective in privacy preservation, al-though it is constrained by limited upstream and downstreambandwidths and the performance may seriously degrade whenthe data distribution is extreme. To address these issues, wehave proposed federated trained ternary quantization (FTTQ),a compression method adjusted for federated learning basedon TTQ algorithm, to reduce the energy consumption at theinference stage on the clients. Furthermore, we have proposedternary federated learning protocol, which compress bothuploading and downloading communications to nearly onesixteenth of the standard method. The optimal solutions ofthe quantization factors, detailed proofs of the unbiasednessand convergence of the proposed methods are also given.Our experimental results on widely used benchmark datasetsdemonstrate the effectiveness of the proposed algorithms.Moreover, since we have reduced the downstream and up-stream communication costs between the clients and the server,we can increase the number of clients or the rounds of

RAFT 12 communications within the same constraint of budgets toimprove the performance of federated learning.Our approach can be seen as an application of trainedternary quantization method by quantizing the global model toreduce the communication costs. However, the great reductionin communication costs is at the expense of the performance offederated learning, in particular when the data on the clientsare extremely non-IID. Our future work will aim at ﬁndinga more efﬁcient approach to improving the performance offederated learning on non-IID data.A

PPENDIX AT RADE - OFF BETWEEN M ODEL C APACITY AND C OMMUNICATION C OSTS

In this appendix, we analyze the weight distribution and theconvergence of the quantiﬁcation factors in the proposed TTQalgorithm. Both MLP and ResNet ∗ are selected to enhance thereliability of the analysis. The initial values of the quantizationfactors are changed in the case of ﬁxing other hyperparameters(e.g., the batch size), and the inﬂuences of the quantizationfactors are observed from two aspects: the convergence trendin each layer and the effect of the absolute difference in thequantization factor values. LWHUDWLRQVí * D S 6DPHLQLWLDOYDOXHV Z S Z Q LWHUDWLRQVí 'LIIHUHQWLQLWLDOYDOXHV Z S Z Q __Z S __Z Q __í * D S (IIHFWRIWKHJDSRI_Z S _DQG_Z Q _ Z S Z Q Fig. 12. Convergence analysis of quantization factors in ternary MLP. Theconvergence trend under different initial values (top) and the effect of the gapof positive and negative values (bottom).

Firstly, we conduct experiments on MLP since there existsonly one quantized hidden layer. As shown on the top of Fig.12, if we subtract the initial value from the values obtainedby each iteration of the quantization factor, we can see thatthe gap between the iterative values and the initial values ofthe positive and negative factors changes in the same trend.Regardless of whether the initial values are the same, theoffsets are consistent with respect to the respective initialvalues.As we know, if the convergence values are the same as theinitial values, TTQ degenerates into TWN and the learningcapability of ternary models may decline. Especially, max( | θ | ) varies greatly due to the distinct data distribution of each clientin the process of federated learning. To address this issue, wereduce the two quantization factors to one, and use (6), (8) toconstrain the threshold between clients. To illustrate the effect of the gap of positive and negativevalues, we ﬁx the initial value of one of the factors andincrease the initial value of the other factor, as shown in thebottom of Fig. 12. We can see that as the interval increases,the change in the values of quantiﬁcation factors is smaller,and ﬁnally it is almost close to 0 (convergence to the initialvalues). So we can conclude that the gradients of I pl and I nl will be tiny in the end, according to (13) and (14).Similar phenomena can also be observed from the exper-iments conducted on ResNet ∗ . The results obtained by theselected appropriate initial values are shown in Fig. 13. Whenthe initial values of the two quantization factors are the same,the convergence proﬁle of w pl and w nl in the l th layer is nearlysymmetrical and the difference between the absolute values ofthe two is almost zero at each epoch (refer to Fig. 13(a)). Incase the initial values of the w pl and w nl are different, it canbe observed that w pl and w nl have the same trend in the l th layer from Fig. 13(b), while the convergence trend of the twoparameters is more ﬂuctuating.R EFERENCES[1] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality ofdata with neural networks,” science , vol. 313, no. 5786, pp. 504–507,2006.[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. VanDen Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,M. Lanctot et al. , “Mastering the game of go with deep neural networksand tree search,” nature , vol. 529, no. 7587, p. 484, 2016.[3] P. Voigt and A. Von dem Bussche, “The eu general data protection regu-lation (gdpr),”

A Practical Guide, 1st Ed., Cham: Springer InternationalPublishing , 2017.[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior,P. Tucker, K. Yang, Q. V. Le et al. , “Large scale distributed deepnetworks,” in

Advances in neural information processing systems , 2012,pp. 1223–1231.[5] Y. Low, D. Bickson, J. Gonzalez, C. Guestrin, A. Kyrola, and J. M.Hellerstein, “Distributed graphlab: a framework for machine learningand data mining in the cloud,”

Proceedings of the VLDB Endowment ,vol. 5, no. 8, pp. 716–727, 2012.[6] A. Smola and S. Narayanamurthy, “An architecture for parallel topicmodels,”

Proceedings of the VLDB Endowment , vol. 3, no. 1-2, pp.703–710, 2010.[7] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin et al. , “Tensorﬂow: Large-scalemachine learning on heterogeneous systems, 2015,”

Software availablefrom tensorﬂow. org , vol. 1, no. 2, 2015.[8] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski,J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machinelearning with the parameter server,” in { USENIX } Symposium onOperating Systems Design and Implementation ( { OSDI } , 2014, pp.583–598.[9] S. Zhang, C. Zhang, Z. You, R. Zheng, and B. Xu, “Asynchronousstochastic gradient descent for dnn training,” in . IEEE, 2013,pp. 6660–6663.[10] K. Hsieh, A. Harlap, N. Vijaykumar, D. Konomis, G. R. Ganger, P. B.Gibbons, and O. Mutlu, “Gaia: Geo-distributed machine learning ap-proaching { LAN } speeds,” in { USENIX } Symposium on NetworkedSystems Design and Implementation ( { NSDI } , 2017, pp. 629–647.[11] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al. ,“Communication-efﬁcient learning of deep networks from decentralizeddata,” arXiv preprint arXiv:1602.05629 , 2016.[12] J. Koneˇcn`y, B. McMahan, and D. Ramage, “Federated optimiza-tion: Distributed optimization beyond the datacenter,” arXiv preprintarXiv:1511.03575 , 2015.[13] J. Koneˇcn`y, H. B. McMahan, D. Ramage, and P. Richt´arik, “Federatedoptimization: Distributed machine learning for on-device intelligence,” arXiv preprint arXiv:1610.02527 , 2016.[14] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federatedlearning with non-iid data,” arXiv preprint arXiv:1806.00582 , 2018. RAFT 13 Z S &RQY Z S Z Q ííí &RQY Z S Z Q ííí Z Q LWHUDWLRQV Z S &RQY Z S Z Q ííí LWHUDWLRQV &RQY Z S Z Q ííí Z Q (a) Same initial values Z S &RQY Z S Z Q ííí &RQY Z S Z Q ííí Z Q LWHUDWLRQV Z S &RQY Z S Z Q ííí LWHUDWLRQV &RQY Z S Z Q ííí Z Q (b) Different initial valuesFig. 13. The convergence trend in speciﬁc layer and convergence values among layers with same (left) and different (right) initial values of the quantiﬁcationfactors.[15] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, andK. Chan, “Adaptive federated learning in resource constrained edgecomputing systems,” learning , vol. 8, p. 9, 2018.[16] Q. Li, Z. Wen, Z. Wu, S. Hu, N. Wang, and B. He, “A survey onfederated learning systems: Vision, hype and reality for data privacyand protection,” arXiv:1907.09693 preprint , 2019.[17] , “Federated machine learning: Concept and applications,” ACM Trans-actions on Intelligent Systems and Technology , vol. 10, no. 2, p. ArticleNo. 12, 2019.[18] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman,V. Ivanov, C. Kiddon, J. Konecny, S. Mazzocchi, H. B. McMahan et al. ,“Towards federated learning at scale: System design,” arXiv preprintarXiv:1902.01046 , 2019.[19] W. Lim, N. C. Luong, D. T. Hoang, Y.-C. L. Yutao Jiao, Q. Yang,D. Niyato, and C. Miao, “Federated learning in mobile edge networks:A comprehensive survey,” arXiv:1909.11875 , 2019.[20] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efﬁcient processing ofdeep neural networks: A tutorial and survey,”

Proceedings of the IEEE ,vol. 105, no. 12, pp. 2295–2329, 2017.[21] speedtest.net, “United kingdom mobile speedtest data,” , 2016.[22] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in

Advances in neural information processing systems , 1990, pp. 598–605.[23] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and con-nections for efﬁcient neural network,” in

Advances in neural informationprocessing systems , 2015, pp. 1135–1143.[24] R. Rigamonti, A. Sironi, V. Lepetit, and P. Fua, “Learning separableﬁlters,” in

Proceedings of the IEEE conference on computer vision andpattern recognition , 2013, pp. 2754–2761.[25] T. Cohen and M. Welling, “Group equivariant convolutional networks,”in

International conference on machine learning , 2016, pp. 2990–2999.[26] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neuralnetwork,” arXiv preprint arXiv:1503.02531 , 2015.[27] H. Zhu and Y. Jin, “Multi-objective evolutionary federated learning,”

IEEE transactions on neural networks and learning systems , 2019.[28] Y. Chen, X. Sun, and Y. Jin, “Communication-efﬁcient federated deeplearning with layer-wise asynchronous model update and temporallyweighted aggregation,”

IEEE Transactions on Neural Networks andLearning Systems , 2019.[29] W. Wen, C. Xu, F. Yan, C. Wu, Y. Wang, Y. Chen, and H. Li, “Terngrad:Ternary gradients to reduce communication in distributed deep learning,”in

Advances in neural information processing systems , 2017, pp. 1509–1519.[30] A. F. Aji and K. Heaﬁeld, “Sparse communication for distributedgradient descent,” arXiv preprint arXiv:1704.05021 , 2017.[31] M. Courbariaux, Y. Bengio, and J. P. David, “Binaryconnect: trainingdeep neural networks with binary weights during propagations,” in

International Conference on Neural Information Processing Systems ,2015. [32] F. Li, B. Zhang, and B. Liu, “Ternary weight networks,” arXiv preprintarXiv:1605.04711 , 2016.[33] F. Sattler, S. Wiedemann, K.-R. M¨uller, and W. Samek, “Robust andcommunication-efﬁcient federated learning from non-iid data,” arXivpreprint arXiv:1903.02891 , 2019.[34] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[35] A. Krizhevsky and G. Hinton, “Learning multiple layers of features fromtiny images,” Citeseer, Tech. Rep., 2009.[36] S. Zhou, Y. Wu, Z. Ni, X. Zhou, H. Wen, and Y. Zou, “Dorefa-net:Training low bitwidth convolutional neural networks with low bitwidthgradients,” arXiv preprint arXiv:1606.06160 , 2016.[37] C. Zhu, S. Han, H. Mao, and W. J. Dally, “Trained ternary quantization,” arXiv preprint arXiv:1612.01064 , 2016.[38] A. Polino, R. Pascanu, and D. Alistarh, “Model compression viadistillation and quantization,” arXiv preprint arXiv:1802.05668 , 2018.[39] R. M. Gray and D. L. Neuhoff, “Quantization,”

IEEE transactions oninformation theory , vol. 44, no. 6, pp. 2325–2383, 1998.[40] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[41] A. W. Bowman and A. Azzalini,