Ternary Compression for Communication-Efficient Federated Learning
DDRAFT 1
Ternary Compression for Communication-EfficientFederated Learning
Jinjin Xu, Wenli Du, Ran Cheng, Wangli He,
Senior member, IEEE , and Yaochu Jin,
Fellow, IEEE
Abstract —Learning over massive data stored in differentlocations is essential in many real-world applications. However,sharing data is full of challenges due to the increasing demandsof privacy and security with the growing use of smart mobiledevices and IoT devices. Federated learning provides a potentialsolution to privacy-preserving and secure machine learning, bymeans of jointly training a global model without uploading datadistributed on multiple devices to a central server. However, mostexisting work on federated learning adopts machine learningmodels with full-precision weights, and almost all these modelscontain a large number of redundant parameters that do not needto be transmitted to the server, consuming an excessive amountof communication costs. To address this issue, we propose afederated trained ternary quantization (FTTQ) algorithm, whichoptimizes the quantized networks on the clients through a self-learning quantization factor. A convergence proof of the quanti-zation factor and the unbiasedness of FTTQ is given. In addition,we propose a ternary federated averaging protocol (T-FedAvg)to reduce the upstream and downstream communication offederated learning systems. Empirical experiments are conductedto train widely used deep learning models on publicly availabledatasets, and our results demonstrate the effectiveness of FTTQand T-FedAvg compared with the canonical federated learningalgorithms in reducing communication costs and maintaining thelearning performance.
Index Terms —Deep learning, federated learning, communica-tion efficiency, ternary coding.
I. I
NTRODUCTION T HE number of Internet of Things (IoTs) and smartmobile devices deployed, e.g., in process industry hasgrown dramatically over the past decades, having generatedmassive amounts of data stored distributively every moment.Meanwhile, recent achievements in deep learning [1], such
Manuscript received xx, 2020; revised xxx, 2020. This work was supportedby National Natural Science Foundation of China (Basic Science CenterProgram: 61988101), International (Regional) Cooperation and ExchangeProject (1720106008), National Natural Science Foundation of China (Ma-jor Program: 61590923), National Natural Science Fund for DistinguishedYoung Scholars (61725301), National Natural Science Foundation of China(Major Program: 61590922) and China Scholarship Council (201906745025).(Corresponding authors: Wenli Du; Yaochu Jin.)Jinjin Xu, Wenli Du and Wangli He are with the Key Laboratory ofAdvanced Control and Optimization for Chemical Processes, Ministry of Ed-ucation, East China University of Science and Technology, Shanghai, 200237,China, and also with Shanghai Institute of Intelligent Science and Technology,Tongji University, Shanghai, 200092, China E-mail: [email protected],[email protected], [email protected] Cheng is the Shenzhen Key Laboratory of Computational Intelli-gence, University Key Laboratory of Evolving Intelligent Systems of Guang-dong Province, Department of Computer Science and Engineering, SouthernUniversity of Science and Technology, Shenzhen 518055, China. Email:[email protected] Jin is with the Department of Computer Science, University ofSurrey, Guildford, GU27XH, UK. E-mail:[email protected]. as AlphaGo [2], rely heavily on the knowledge stored in bigdata. Naturally, to adopt deep learning methods for effectiveutilization of the rich data contained in local clients of theprocess industry, e.g., branch factories, will provide a strongsupport to industrial production. However, training deep learn-ing models by distributed data is difficult, while uploadingprivate data to cloud is controversial, due to limitations onnetwork bandwidth, budgets and security regulations, e.g.,GDPR [3].Many research efforts have been devoted to related fieldsrecently, while early work in this area has mainly focusedon training deep models on multiple machines to alleviatethe computational burden of large data volumes, known asdistributed machine learning [4]. These methods have achievedsatisfactory performance, by splitting big data into tiny setsto accelerate the model training process. For example, dataparallelism [4], model parallelism [4], [5] (see in Fig. 1)and parameter server [6]–[8] are commonly used methods inpractice. Correspondingly, the weights optimization strategiesfor multiple machines have also been proposed. For example,Zhang et al. [9] proposed asynchronous mini-batch stochasticgradient descent algorithm (ASGD) on multi-GPU devicesfor deep neural networks (DNNs) training and achieved 3.2times speed-up on 4 GPUs than the single one without loss ofprecision. Recently, distributed machine learning algorithmsfor multiple datacenters which are located in different regionshave been studied in [10]. However, little attention has beenpaid to the data security and the impact of data distributionon the performance.To address the drawbacks of distributed learning, re-searchers have proposed an interesting framework to train aglobal model while keeping the private data locally, known asfederated learning [11]–[13]. The federated approach makesit possible to extract knowledge in the data distributed onlocal devices without uploading private data to a certain server.Fig. 2 illustrates the simplified workflow and an applicationscenario in the process industry. Several extensions have beenintroduced to the standard federated learning system. Zhao etal. [14] have observed weight divergence caused by extremedata distributions and proposed a method of sharing a smallamount of data with other clients to enhance the perfor-mance of federated learning algorithms. Wang et al. [15] haveproposed adaptive federated learning systems under a givenresource budget based on a control algorithm that balancesthe client update and global aggregation, and analyzed theconvergence bound of the distributed gradient descent. Recentcomprehensive overviews of federated leanring can be found in[16], [17], and design ideas, the challenges and future research a r X i v : . [ c s . L G ] M a r RAFT 2 directions of federated learning on massively mobile devicesare presented in [18], [19].Since users may pay more attention to privacy protectionand data security, federated learning will play a key role indeep learning, although it faces challenges in terms of datadistribution and communication costs.
1) Data Distribution:
The data generated by differentclients, e.g., factories, may be unbalanced and not subject tothe independent and identical distribution hypothesis, whichmeans unbalanced and/or non-IID datasets.
2) Communication Costs:
Federated learning is influencedby the rapidly growing depth of model and the amount ofmultiply-accumulate operations (MACs) [20]. This is due tothe fact that the massive communication costs for uploadingand downloading is necessary, while the average upload anddownload speed is asymmetric, e.g., 26.36 Mbps mean mobiledownload speed vs. 11.05 Mbps upload speed of UK in Q3-Q42017 [21].
Client 1Client 3 Client 2Client 4
Fig. 1. The diagram of model parallelism. The complete model is distributedand stored in multiple clients, and the model is stitched after the trainingprocedure is finished on the clients.
Obviously, high communication costs is one of the mainreasons hindering distributed and federated training. Althoughthe initial model compression related research is not intendedto reduce the communication costs, it has been a source ofinspiration for communication efficient distributed learning.Neural network pruning is an early method of model com-pression proposed in [22]. Parameter pruning and sharing in[22], [23], low-rank factorization [24], transferred/compactconvolutional filters [25] and knowledge distillation in [26]are some main ideas reported in the literature. Reductionof communication costs by simultaneously minimazing theperformance and minimizing the model complexity using anmulti-objective evolutionary algorithm is reported in [27].Recently, an layer-wise asynchronous model update approachhas been proposed in [28] to reduce the number of parametersto be transmitted.Gradient quantization has been proposed to accelerate dataparallelism distributed learning [29]; gradient sparsification[30] and gradient quantization [31], [32] have been developedto reduce the model size; an efficient federated learning by sparse ternary compression (STC) has been proposed [33],which is robust to non-IID data and communication-efficienton both upstream and downstream communications. However,since the STC is a model compression method after localtraining is completed, the quantization process is not optimizedduring training.To the best of our knowledge, most of the federated learningmethods emphasize the application of full-precision networksor streamline models after the training procedure on client isfinished, rather than simplifying the model during training.Therefore, deploying the federated learning environment inthe widely used IoT devices in the process industry is some-how difficult. To address this issue, we focus on the modelcompression on clients during training to reduce the energyconsumption at inference stage and the communication costsof federated learning. The main contributions of this paper aresummarized as follows:
Fig. 2. An illustration to the application of federated learning in processindustry. The branch factories train local models on private clients data throughiterations, and send the trained local models to the server for aggregation toobtain optimized the global model. • A ternary quantization approach is introduced into thetraining and inference stages of clients. The trainedternary models are well suited for inference in networkedge devices, e.g., wireless sensors. • A ternary federated learning protocol is presented to re-duce the communication costs between clients and server,which compresses both upstream and downstream com-munications. Note that the quantification of the modelweights can further enhance privacy protection since thismakes the reverse engineering of the model more difficult. • A theoretical analysis of the proposed algorithm is pro-vided and the performance of the algorithm is empiricallyverified on a deep feedforward network (MLP) and a deepresidual network (ResNet) using widely used datasets,namely MNIST [34] and CIFAR10 [35].The remainder of this paper is organized as follows. InSection II, we briefly review the standard federated learningprotocol and several widely used network quantization ap-proaches. Section III proposes a method to quantify modelsof clients in federated learning systems, called federatedtrained ternary quantization (FTTQ), based on the quantitativealgorithms mentioned earlier. On the basis of FTTQ, a ternaryfederated learning protocol that reduces both upstream anddownstream communications is presented. In Section IV, a
RAFT 3 theoretical analysis of the proposed algorithm is provided.Experimental settings and results are presented in Section V tocompare the new protocol with standard algorithms. Finally,conclusions and future directions are given in Section VI.II. B
ACKGROUND AND M ETHODS
In this section, we first introduce some preliminaries ofthe standard federated learning workflow and its basic for-mulations. Subsequently, the definitions and main features ofpopular ternary quantization methods are presented, followedby an numeric example.
A. Federated Learning Protocol
It is usually assumed that the data used by a distributedlearning algorithm belongs to the same feature space, whichmay not be true in federated learning. As illustrated in Fig.3, the basic protocol of federated learning proposed in [18]is round-based. Specifically, private storage, server and clients(usually mobile devices) are main participants in the wholeprotocol, and there are three main phases in each round:selection, configuration and reporting.
CheckCheckCheck
Selection Configuration
TrianingTrianing
Reporting
Check Trianing Aggregation round
Fig. 3. Federated learning workflow with massive mobile devices. Firstly, theserver selects suitable clients to deploy configuration (global model structure)for training. Then the clients complete the learning process in the specifiedbudget (e.g., time) and return the local models to the server. Finally, the serveraggregate all local models to obtain the trained global model.
In this work, we assume supervised learning is used for trainthe models. The global model with parameters θ , deployed indistributed client k , is trained based on local training dataset D k , which consists of training sample pairs ( x i , y i ), ( i =1 , , , ..., | D k | ) . The loss of sample pair ( x i , y i ) is denotedby l ( x i , y i ; θ ) , where l is the loss function. The total loss ofa certain task on client k is J : J k ( θ ) = 1 | D k | | D k | (cid:88) i l ( x i , y i ; θ ) . (1)We assume there are N clients whose data is stored inde-pendently, and the aim of federated learning is to minimizethe global loss L . Therefore, the global objective function ofthe federated learning system can be defined as: L ( θ ) = λN (cid:88) k | D k | (cid:80) λNk | D k | J k ( θ ) , (2) where λ is the proportion of the clients participating in theaggregation in the current round, which is determined by theabove three phases.Theoretically, the participation ratio λ in (2) is calculatedby the number of participating clients and the total number ofclients. Additionally, the local batch size B and local epochs E are also important hyperparameters. In the experiment,we will further study the effects of these parameters on theperformance of the algorithm by manually setting the valuesof them.It is easy to find that the communication costs are heavilydependent on the amount of information to be transferredbetween the server and clients, and the dominating factorin this procedure is the size of the parameters. One of therequirements for communication-efficient federated learning tobe fulfilled is that both upstream and downstream communica-tions need to be compressed [33]. Note that the performanceof federated learning may dramatically drop due to dispropor-tionation of data distribution. B. Quantization
Quantization improves energy and space efficiency of deepnetworks by reducing the number of bits per weight [36].This is done by mapping the parameters in a continuousspace to a quantization discrete space, which can greatlyreduce model redundancy and save memory overhead. It hasbeen shown that the ternary weight network (TWN for short)[32] is able to reduce the Euclidean distance between thequantization parameters and θ t (consisting of -1, +1 and 0) andthe full-precision parameters θ by a scaling factor α comparedwith binary networks [31], thus making the accuracy of thequantization network close to the full-precision network: α ∗ , θ t ∗ = arg min α,θ t ∗ F ( α, θ t ) = (cid:107) θ − αθ t (cid:107) , (3)where F represents the cost function of this optimization prob-lem; α ∗ and θ t ∗ are optimal solution to F , with θ ≈ α ∗ θ t ∗ .Li et al. [32] introduce an approximated optimal solution witha threshold-based function to quantify all layers of the deepneural network model: θ tl = +1 , θ l > ∆ l , | θ l | ≤ ∆ l − , θ l < − ∆ l , (4)where the θ l and θ tl are the full-precision and quantizedweights of l th layer, and θ = { θ , θ , ..., θ l , ... } , θ t = { θ t , θ t , ..., θ tl , ... } , respectively, which provide a rule of thumbto calculate the optimal ∆ ∗ = { ∆ ∗ , ∆ ∗ , ..., ∆ ∗ l , ... } .However, the weights of TWN are limited to -1, 0, 1 and α ∗ is a constant. In order to further improve the performance ofquantized deep networks while maintaining the compressionratio, Zhu et al. [37] have proposed a trained ternary quanti-zation algorithm (TTQ for short). In TTQ, two quantizationfactors (positive factor w p and negative factor w n ) are adoptedto scale the ternary weights in each layer. RAFT 4 - ---- - Loss gradientsgradientsNormalize ---- - - - Δ = 0.55 -1-1 -1 -1-1 -1 -1 3 0 -2
300 33 0-2-2 -2 -2-2 -2 -2
W_p=3 W_n=2
Fig. 4. An example of how the TTQ algorithm works. Firstly, normalizedfull-precision weights and biases are quantized to { -1, 0, +1 } by the giventhreshold per layer. Secondly, positive and negative quantification factors areused to scale the quantized weights. Finally, the calculated gradients are back-propagated to each layer. The right part in the dotted rectangle represents theinference stage. The workflow of TTQ is illustrated by Fig. 4, the normal-ized full-precision weights are quantized by ∆ l , w pl and w nl with full-precision activations. Instead of using the optimizedthreshold in TWN, TTQ adopts a heuristic method to calculate ∆ l : ∆ l = t × max( | θ l | ) , (5)where t is a constant factor determined by experience and ∆ = { ∆ , ∆ , ..., ∆ l , ... } .III. P ROPOSED A LGORITHM
In this section, we first propose a federated trained ternaryquantization (FTTQ for short) to reduce the energy consump-tion in each client during inference and the upstream anddownstream communications. Subsequently, a ternary feder-ated averaging protocol (T-FedAvg for short) is suggested.
A. Federated Trained Ternary Quantization
Since no direct data exchange is usually allowed betweenclients in the federated learning system, weight divergence [14]may be different among clients. For example in the l th layerof the global model shared by client C and client C , if max( | θ C l | ) = 5 and max( | θ C l | ) = 50, it is not necessarilytrue that the global model will be biased towards C if weuse the same factor for the two models. To address this issue,we start by scaling the weights to [-1, 1]: θ s = g ( θ ) , (6)where g is a scaling function, R n → [ − , . However,magnitude imbalance [38] may be introduced when scalingthe entire θ of a certain network, thus resulting in significantloss of precision, since most of the elements are pushed tozero. Therefore, we scale the weights layer by layer.Then, by using the same strategy as TTQ, we calculate thequantization threshold ∆ according to the scaled weights asfollows: ∆ = T k × max( | θ s | ) , (7)where T k is a hyperparameter with a default setting on theclient k , and ∆ = { ∆ , ∆ , ..., ∆ l , ... } .However, according to (6) and (7), we can easily findthat the thresholds in all layers are mostly the same since the maximum absolute value of the scaled θ s is 1 in mostlayers. Thus, the model capability may be effected by thehomogeneity of the threshold. To avoid this issue, we proposean alternative threshold calculation criterion: ∆ = T k m m (cid:88) i ( | θ si | ) , (8)where m is the number of neurons and ∆ is layer-wise calcu-lated. Obviously, the threshold obtained by (8) is influencedby the layer sparsity and can be seen as an extension of (7)as: ∆ = T k × m m (cid:88) i ( | θ si | ) ≤ T k × m ( m × max | θ s | ) ≤ T k × m ( m × ≤ T k . (9)Notably, the threshold will turn into the optimal solutionproposed in [32] if we set the value of T k to 0.7. Thecalculation method of ∆ is generally adjusted according tothe performance.Subsequently, several operations are taken to achieve layer-wise weight quantization to overcome the computation burdenand reduce the communication costs: mask = ε ( | θ s | − ∆) , (10) I t = sign( mask (cid:12) θ s ) , (11) θ t = w q × I t , (12)where ε is the step function and (cid:12) is the Hadamard product, w q = [ w q , w q , ..., w ql , ... ] is an independent quantization vectorwhich is trained together with other parameters layer by layer,and I t is the quantized ternary weights. Consequently, themask matrix can be rewritten as a union of a positive index I p and a negative index I n of the local model: I p = { i | θ i > ∆ } , (13) I n = { i | θ i < − ∆ } . (14)Different from the standard TTQ, we adopt a quantizationfactor which is updated with its gradients together with otherparameters instead of the previous two quantization factors ineach layer, mainly due to the following reasons.
1) Stability:
Large weight divergence will be encounteredafter synchronization if participating clients are initialized withdifferent parameters, which leads to performance degeneration[14]. Hence, the weight divergence should be minimized ateach layer in the federated learning environment.
RAFT 5
2) Energy Consumption:
We present an proposition aboutthe convergence trends of w p and w n and its proof in SectionIV, followed by the experimental results on the two factorswith different initials when training MLP and ResNet ∗ inAppendix A. It is worth noting that the trend of the positiveand negative quantization factors of TTQ algorithm is almostthe same in all layers. Hence, the energy consumption incalculating the gradients of two quantization factors during theback propagation can be cut in half if only one quantizationfactor is retained, which is important for some resource-constrained clients.After quantifying the whole network, the loss function canbe calculated and the errors be backpropagated in the sameway as for continuous weights except that the weights are ± w q or zeros. The gradients of w q and latent full-precision modelare calculated according to the rules in [37]. The new updaterule is summarized in Algorithm 1. Consequently, FTTQsignificantly reduces the size of the updates transmitted to theserver, thus reducing the costs of upstream communications.However, the costs of the downstream communications willnot be reduced if no additional measures are taken, since theweights of the global model cannot be decomposed into thecoefficient and ternary matrix after aggregation. To addressthis issue, a ternary federated learning protocol is presented inthe next section. Algorithm 1:
Federated Trained Ternary Quantization(FTTQ)
Input:
Full-precision parameters θ and quantizationvector w q , loss function l , dataset D with samplepairs ( x i , y i ) , i = { , , ..., | D |} , learning rate η . Output:
Quantified model θ t init: All clients parameters are initialized with θ . for ( x i , y i ) ∈ D do θ s ← g ( θ ) mask ← ε ( | θ s | − ∆) I t = sign( mask (cid:12) θ s ) θ t ← w q × I t J ← l i ( x i , y i ; θ t ) ∂J∂w q ← (cid:80) i ∈ I p ∂J∂θ ti w q ← w q + η ∂J∂w q θ t ← θ t + η ∂J∂θ t update θ endReturn θ t (including w q , I t ) B. Ternary Federated Averaging
The two-step scheme of the proposed ternary federatedaveraging protocol with private data is elaborated in Fig. 5.In general, the participating clients quantize the normalizedlocal models and upload the thresholds, quantization factorsand ternary models to the server. Then the server aggregatesall local models to obtain the global model. Finally, the server quantifies the normalized global model again using fixedthresholds and pushes the quantized global model to all clients.The basic flow is described as follows.
1) Upstream:
Let K = {
1, 2,..., | λN |} be the set of indiceswhich represent the randomly selected clients that participatein the aggregation, where λ is the participation ratio and N is the total number of the clients in the federated learningsystem. The local scaled full-precision and quantized weightsof client k ∈ K are denoted by θ sk and θ tk , respectively. Weupload the trained θ tk ( w q and I t ) to the server instead ofthe updates ∇ θ k after local iterations, although the two areequivalent [11]. And at inference stage, only the quantizedmodel is needed for prediction.
2) Downstream:
After r communication rounds, the serverwill rebuild all models received from the participated clients,and the global model can be calculated by θ r +1 ← (cid:80) λNk =1 | D k | (cid:80) λNk =1 | D k | θ tk . Then the server will quantify the globalmodel again with a constant threshold ∆ using a default settingof 0.05 and push the quantized model to the clients. Algorithm 2:
Ternary Federated Averaging
Input:
Initial global model parameters θ Init:
Broadcast θ to clients k , k = { , , , ..., N } , assigneach client a unique dataset D k . for round r = T dofor c ∈ K = { , , ..., | λN |} in parallel doClient k does: download quantified θ tr − initialize w q θ tk,r ← FTTQ ( θ tr − , w q )upload θ tk,r to server endendServer does: θ r ← (cid:80) λNk =1 | D k | (cid:80) λNk =1 | D k | θ tc,r mask ← ε ( | θ r | − ∆) θ tr ← sign( mask (cid:12) θ r ) broadcast θ tr to all clients endend Unlike standard federated learning algorithms, our methodcompresses communications during the upload and downloadphases, which brings major advantages when deploying DNNsat the inference stage for resource-constrained devices. Specif-ically, the clients move the local networks from 32-bit to 2-bitand push the 2-bit networks and quantification parameters tothe server, and then download the quantized global model fromthe server at the end. For example, if we configure a federatedlearning environment involving 20 clients and a global modelthat requires 25 MB of storage space, the total communicationcosts of the standard federated learning is about 1 GB perround (upload and download). By contrast, our method reducesthe costs to 65 MB per round (upload and download), which is
RAFT 6 Clients Upload Server D q w q w q w D D Aggregate Normalize Quantize Download
A N Q DANDDD QQQ QQQNNN
Fig. 5. The diagram of proposed T-FedAvg. The blue part runs on the clients with normalized full-precision weights, which is similar to the standard federatedlearning framework; then the quantification factors, thresholds and ternary local models are pushed to the server, as shown in the orange part; after that, theglobal model is obtained by the server aggregation and normalization; finally, the global model is quantized and pushed back to all clients. about 1/16 of the standard method. Note that quantifying theglobal model pushed to the clients makes reverse engineeringmore difficult. The overall workflow of the proposed ternaryfederated protocol is summarized in Algorithm 2.IV. T
HEORETICAL ANALYSIS
In this section, we first formally demonstrate the propertiesof two quantization factors w p , w n in TTQ, followed by aproof of unbiasedness of FTTQ and T-FedAvg.By default in this paper, the subscripts represent the indicesof the elements in a network instead of the indices of theclients in the federated learning system. A. The Convergence of Quantization Factors in TTQ
The experimental results of the convergence profiles of w p and w n of two widely used neural networks are presented inAppendix A, showing that the two factors have converged tothe same value. To theoretically prove the convergence, we atfirst introduce the following assumption. Assumption 4.1:
The elements in scaled full-precision θ areuniformly distributed between 0 and 1, ∀ θ i ∈ θ, θ i (cid:118) U ( − , . (15)Then we have the following proposition. Proposition 4.1:
Given an one-layer online gradient system,each element of its parameters is initialized with a symmetricdistribution centered at 0, e.g., θ i (cid:118) U ( − , , which is quantized by TTQ with two iterative factors w p , w n and afixed threshold ∆ , then we have: lim e → + ∞ w p = lim e → + ∞ w n , (16)where e is the training epoch and w p , w n , ∆ > . Proof 4.1:
The converged w ∗ p and w ∗ n can be regarded as theoptimal solution of the quantization factors, which can reducethe Euclidean distance between the full-precision weights θ and the quantized weights θ t , which is equal to w p I p − w n I n .Then we have: w ∗ p , w ∗ n = arg min w p ,w n (cid:107) θ − w p I p + w n I n (cid:107) , (17)where I p = { i | θ i ≥ ∆ } , I n = { j | θ j ≤ − ∆ } and I z = { k || θ k | < ∆ } , and according to (4) we have θ − w p I p + w n I n = θ i − w p , i ∈ I p θ k , k ∈ I z θ j + w n , j ∈ I n . (18)Then the original problem can be transformed to (cid:107) θ − w p I p + w n I n (cid:107) = (cid:88) i ∈ I p ( θ i − w p ) + (cid:88) j ∈ I n ( θ j + w n ) + (cid:88) k ∈ I z θ k = | I p | w p + | I n | w n − w p (cid:88) i ∈ I p θ i + 2 w n (cid:88) j ∈ I n θ j + C, (19) RAFT 7 where C = (cid:80) i ∈ I p θ i + (cid:80) j ∈ I n θ j + (cid:80) k ∈ I z θ k is a constant independentof w p and w n . Hence the optimal solution of (19) can beobtained when w ∗ p = 1 | I p | (cid:88) i ∈ I p θ i ,w ∗ n = − | I n | (cid:88) j ∈ I n θ j . (20)Since the weights are distributed symmetrically, w ∗ p and w ∗ n will converge to the same value. This completes the proof. B. The Unbiasedness of FTTQ
Here, we first prove the unbiasedness of FTTQ. To simplifythe original problem, we adopt an assumption that is commonin network initialization.With Assumption 4.1, we prove Proposition 4.2:
Proposition 4.2:
Let θ be the local scaled network param-eters defined in Assumption 4.1 of one client in a given fed-erated learning system. If θ is quantified by FTTQ algorithm,then we have E [ F T T Q ( θ )] = E ( θ ) . (21) Proof 4.2:
According to (20), w ∗ q is calculated by theelements in I p = { k | θ k ≥ ∆ } , where ∆ is a fixed number oncethe parameters are generated under Assumption 4.1, hencethe elements indexed by I p obey a new uniform distributionbetween ∆ and , then we have ∀ k ∈ I p , θ k (cid:118) U (∆ , , (22)therefore, the probability density function f of θ k ( k ∈ I p )can be regarded as f ( x ) = − ∆ .According to Proposition 4.1 and (20), we have: E ( w q ∗ ) = E | I p | (cid:88) k ∈ I p θ k = 1 | I p | E (cid:88) k ∈ I p θ k = 1 | I p | | I p | (cid:90) θ i f ( θ i ) = (cid:90) θ i f ( θ i )= 1 + ∆2 , (23)where θ i is an arbitrary element in I p and | I p | represents thenumber of elements in I p .We know that E [ F T T Q ( θ )] = E [ w q ∗ × sign ( mask ( θ ) × θ )]= E ( w q ∗ ) E { sign ( θ ) E [ mask ( θ )] } , (24)and since E [ mask ( θ )] = P [ mask ( θ ) = 1)] × P [ mask ( θ ) = 0] × P [ mask ( θ ) = 0] × ( − − ∆2 × × − ∆2 × ( − , (25) hence E [ F T T Q ( θ )] = E ( w q ∗ ) E { sign ( θ ) E [ mask ( θ )] } = 1 + ∆2 × , (26)and under Assumption 4.1, we have E ( θ ) = (cid:90) − θdθ = 0 , (27)then it is immediate that E [ F T T Q ( θ )] = E ( θ ) . (28)Hence, the FTTQ quantizer output can be considered as anunbiased estimator of the input [39]. We can guarantee theunbiasedness of FTTQ in federated learning systems whenthe weights are uniformly distributed. Furthermore, since thedistribution of network weights in most of layers may be non-uniform due to the stochastic errors from data, the self-learningfactor w q and ∆ may reduce the quantization errors, which canbe regarded as a non-uniform sampling method. C. The Properties of T-FedAvg
Here, we adopt the following assumption to demonstratethe properties of T-FedAvg, which is widely used rules in theliterature [14].
Assumption 4.2:
When a federated learning system with K clients and one server is established, all clients will beinitialized with the same global model.Under the above assumption, Zhao et al. [14] proved thefollowing conclusions: Lemma 4.1:
The weight divergence which leads to theperformance degeneration after r rounds of synchronizationbetween the clients and the server mainly comes from twoparts, including the weight divergence after r − roundsof aggregation, i.e., || θ fr − − θ cr − || (superscript c denotesthe centralized setting), and the weight divergence resultingfrom the Earth mover’s distance (EMD) between the datadistribution on each client and the actual distribution of thewhole data population. Lemma 4.2:
When all the clients are initialized with thesame global model, the EMD between the data distribution oneach client and the distribution of the whole data populationbecomes the main cause of the performance degeneration.Since our method is unbiased and can reduce the Euclideandistance between the quantized network and the full-precisionnetwork, we can conclude that T-FedAvg can also performwell if the original FedAvg converges to the optimal solution.V. E
XPERIMENTAL R ESULTS
This section evaluates the performance of the proposedmethod on widely used benchmark datasets. We set upmultiple controlled experiments to examine the performancecompared with the standard federated learning algorithm interms of the test accuracy and communication costs. In thefollowing, we present the experimental settings and the results.
RAFT 8
A. Settings
To evaluate the performance of the proposed network quan-tization and ternary protocol in federated learning systems, wefirst conduct experiments with 10 independent physical clientsconnected by a Local Area Network (LAN). Then, we test theobtained model in the simulation environment with the numberof clients varying from 10 to 100.The physical system consists of four CPU laptops thatare connected wirelessly through LAN to mimic low-powermobile devices, one of which acts as the server aggregationmodel and the remaining laptops act as clients participatingin the federated training. Each client only communicates withthe server and there is no information exchange between theclients.For simulations, we typically use 10 clients for experimentsaccording to the number of classes in the datasets. A detaileddescription of the configuration is given below.1)
Compared algorithms . In this work, we compare thefollowing algorithms: • Baseline: the centralized learning algorithm, such asstochastic gradient descent (SGD) method, whichmeans that all data is stored in a single computingcenter and the model is trained directly using the entiredata. • FedAvg: the canonical federated learning approachpresented in [11]. • TTQ: the canonical trained ternary quantizationmethod, in which the configuration is the same as thebaseline, i.e., the data is stored in a centralized mannerand a model is trained using all the data. • T-FedAvg: our proposed quantized federated learningapproach.2)
Datasets . We select two representative benchmark datasetsthat are widely used for classification, and no data augmenta-tion method is used in all experiments. • MNIST [34]: it contains 60000 training and 10000testing gray-scale handwritten image samples with 10classes, where the dimension of each image is 28 × • CIFAR10 [35]: it contains 60000 colored images of 10types of objects from frogs to planes, 50000 for trainingand 10000 for testing. It is a widely used benchmarkdata set that is difficult to extract features.2)
Models . To evaluate the performance of above algorithms,two popular deep learning models are selected: MLP andResNet ∗ , which represent tiny and large models, respectively.The detailed setting are as follows: • MLP: it is mainly used for training small data sets, e.g.,MNIST. The model contains two hidden layers withthe number of neurons of 30 and 20, respectively. Forcentralized and distributed training, the learning rate η is set to the same and the ReLU function is selectedas the activation function. • ResNet18 ∗ : it is a simplified version of the widely usedResNet [40], where the number of input and outputchannels for all convolutional layers is reduced to 64.It is a typical benchmark model for evaluating theperformance of algorithms on large data sets.3) Data distribution . The performance of federated learning isaffected by the features of training data stored on the separatedclients. To investigate the impact of different data distributions,several types of data are generated: • IID data: each client holds an IID subset of datacontaining 10 classes, thus having a IID-subset of thedata. • Non-IID data: the union of the samples in all clients isthe entire dataset, but the number of classes containedin each client is not equal to the total number ofcategories in the entire dataset (10 for MNIST andCIFAR10). We can use the label to assign samples of N c classes to each client, where N c is the numberof classes per client. In case of extremely non-IIDdata, N c is equal to 1 for each client, but this caseis generally not considered since there is no need totrain (e.g., classification) if only one class is stored oneach client. • Unbalancedness in data size: typically, the size of thedatasets on different clients varies a lot. To investigatethe influence of the unbalancedness in data size inthe federated learning environment, we split the entiredataset into several distinct parts.3)
Basic configuration . The basic configuration of federatedlearning system in our experiments is set as follows: • Total number of clients: N = 100. • The participation ratio per round: λ = 0.1. • Classes per client: N c = 10. • Local batch size: B = 64. • Local epochs: E = 5.All experimental settings are summarized in Table I. TABLE IM
ODELS AND HYPERPARAMETERS . Models MLP ResNet ∗ Dataset MNIST CIFAR10Optimizer SGD AdamLearning rate 0.0001 0.008Baseline 92.75% 86.30%Parameter amount 24330 607050
The learning rate of the centralized and federated learningalgorithms is the same and remains constant during the train-ing. Note that a small learning rate is set for training MLP toslow down the convergence speed for easy observation.
RAFT 9
B. Performance on IID Data
In this part, we conduct experiments on IID MNIST andCIFAR10 using the benchmark algorithms with MLP andResNet ∗ mentioned above, where the baseline and TTQ arerepresentatives of centralized approaches.Specifically, the data used by the centralized methods isstored centrally in one computing center, while the data usedby FedAvg and T-FedAvg is stored separately. To explorethe best performance of FedAvg and T-FedAvg, the federatedlearning environment is set with 10 fully participating clientsand each client holds an IID subset of data containing 10classes.The results and model weight width are summarized inTable II. We can see that the test accuracies achieved by thebaseline algorithm and TTQ are 92.75%, 92.87%, respectively,when trained on MNIST with MLP, and 86.30%, 85.73%,respectively, on CIFAR10 with ResNet ∗ . TTQ has a slightdeterioration in performance on CIFAR10 when the modelcomplexity is reduced. TABLE IIT
EST ACCURACIES ACHIEVED AND WEIGHTS WIDTH OF DIFFERENTALGORITHMS WHEN TRAINED ON
IID
DATA
Methods MNIST CIFAR10Accuracy Width Accuracy WidthBaseline 92.75% 32 bit 86.30% 32 bitFedAvg 92.37% 32 bit 85.72% 32 bitTTQ 92.87% 2 bit 85.73% 2 bitT-FedAvg 92.75% 2 bit 86.60% 2 bit
When the data distribution is IID, FedAvg achieves 92.37%and 85.72% accuracies, respectively, on MNIST and CI-FAR10. However, the test accuracy of T-FedAvg, which isabout 1/16 of the full-precision model size, achieves 92.75%on MNIST and the highest on CIFAR10, 86.60%. It is worthnoting that, as the depth of the network deepens, the quantiza-tion error is declining and may even exceed the accuracy of theoriginal model, which can be evidenced by the performanceof T-FedAvg on ResNet.The convergence speeds of the four methods in differentlocal iterations are illustrated in Fig. 6, where the centralizedmethods, the baseline and TTQ obtain convergence curvescorresponding to the federated learning environment by theinterval of λ ∗ N ∗ E . Overall, the convergence speed of ourmethod is the fastest when trained on MNIST and is slightlyslower than FedAvg on CIFAR10 in the initial phase, whichis up to the performance of TTQ. 5 R X Q G V 7 H V W $ F F X U D F \ 0 / 3 0 1 , 6 7 % D V H O L Q H 7 7 4 ) H G $ Y J 7 ) H G $ Y J 5 R X Q G V 7 H V W $ F F X U D F \ 5 H V 1 H W &