[PDF] Double Momentum SGD for Federated Learning

Abstract

Communication efficiency is crucial in federated learning. Conducting many local training steps in clients to reduce the communication frequency between clients and the server is a common method to address this issue. However, the client drift problem arises as the non-i.i.d. data distributions in different clients can severely deteriorate the performance of federated learning. In this work, we propose a new SGD variant named as DOMO to improve the model performance in federated learning, where double momentum buffers are maintained. One momentum buffer tracks the server update direction, while the other tracks the local update direction. We introduce a novel server momentum fusion technique to coordinate the server and local momentum SGD. We also provide the first theoretical analysis involving both the server and local momentum SGD. Extensive experimental results show a better model performance of DOMO than FedAvg and existing momentum SGD variants in federated learning tasks.

Full PDF

DD OUBLE M OMENTUM

SGD

FOR F EDERATED L EARNING

A P

REPRINT

An Xu, Heng Huang

Department of Electrical and Computer EngineeringUniversity of Pittsburgh {an.xu, heng.huang}@pitt.edu

Feb 5, 2021 A BSTRACT

Communication efﬁciency is crucial in federated learning. Conducting many local training steps inclients to reduce the communication frequency between clients and the server is a common methodto address this issue. However, the client drift problem arises as the non-i.i.d. data distributionsin different clients can severely deteriorate the performance of federated learning. In this work,we propose a new SGD variant named as DOMO to improve the model performance in federatedlearning, where double momentum buffers are maintained. One momentum buffer tracks the serverupdate direction, while the other tracks the local update direction. We introduce a novel servermomentum fusion technique to coordinate the server and local momentum SGD. We also provide theﬁrst theoretical analysis involving both the server and local momentum SGD. Extensive experimentalresults show a better model performance of DOMO than FedAvg and existing momentum SGDvariants in federated learning tasks.

With deep learning models becoming prevalent but data-hungry, data privacy emerges as an important issue. To preservedata privacy without losing access to massive data, federated learning [15] was proposed. To avoid prohibitivelygathering data to the data center, the server only communicates the model weights and its update with the participatingclients. Therefore, the participating clients can keep their data private, locally train its model, and then send the localmodel update back to update the server model at the server node and conclude one training round.However, training a deep learning model requires many training iterations to converge. Unlike workers in data-centerdistributed training with large network bandwidth and relatively low communication delay, the clients participating inthe collaborative federated learning system can be faced with much more unstable conditions and slower network linksdue to geo-distribution. Typically, [1] showed that one communication round in federated learning could take about 2 to3 minutes in practice. To address the communication inefﬁciency, FedAvg [19] was proposed and acknowledged asthe most basic method in federated learning. In FedAvg, the server randomly selects some clients and send the servermodel to them. Each client conducts many local training steps using SGD and local data based on the server model andsends back the updated model to the server. The server then averages the models received from clients and ﬁnishes oneround of training. Therefore, the number of communication rounds is greatly reduced because communication is notconducted every training iteration. Here we also refer to the idea of FedAvg as periodic averaging. When we have thefull participation of the clients in each training round, FedAvg reduces to local SGD [26, 31]. Another parallel line ofworks is to compress the communication in federated learning to reduce the volume of the message [23]. But in thispaper, we do not consider communication compression.Although periodic averaging methods such as FedAvg greatly improve the training efﬁciency in federated learning, anew problem named client drift arises. Because we cannot gather and randomly shufﬂe the client data as data-centerdistributed training does, the data distributions of different clients are non-i.i.d.

Therefore, the gradients computedat different clients can be highly skewed. Given that we do many local training steps in each training round, skewedgradients will lead to update directions gradually diverging and over-ﬁtting local data in different clients. This client a r X i v : . [ c s . L G ] F e b PREPRINT - F EB

5, 2021drift issue can deteriorate the performance of FedAvg drastically [33, 8, 9], especially with a low similarity of the datadistribution on different clients and a large number of local training steps.Many efforts have been made to tackle the critical client drift problem in federated learning, and momentum is one ofthem. As a method to reduce variance shown in recent work [3] and smooth the model update direction, momentumSGD has shown its power in training various deep learning models in various tasks [27]. [9] proposed to maintainthe momentum for the average local model update in a training round and named it as server momentum (or globalmomentum) SGD. The server momentum SGD method has been proposed in [2] for training speech models and in [29]for distributed training, but [29] did not apply it to federated learning. Vanilla momentum SGD maintains momentumfor the gradient in each training step. To distinguish it from the server momentum SGD method, we refer to vanillamomentum SGD as local momentum SGD in federated learning throughout this paper. [9] empirically showed theability of server momentum SGD to tackle client drift in federated learning, while [29] provided a theoretical analysis ofserver momentum SGD. However, there has been a lack of understanding in the connection between the server andlocal momentum SGD in federated learning. Besides, whether we can further improve momentum-based method in federated learning remains another question.In this paper, we answer the above questions by proposing double momentum SGD (DOMO). we consider cross-silofederated learning [12] scenario in this work, where the number of participating clients is comparatively small, so it isreasonable to require the full participation of all the clients in each training round. But each client may possess a largeamount of local data, making it prohibitive to compute the full local gradient. Practical scenarios include collaborativelearning of hospitals in health care, ﬁnancial institutes, etc. In contrast to cross-silo federated learning, cross-devicefederated learning [12] has an extremely large number of participating clients such that we can only sample a fractionof the clients for training in each round, but each client tends to possess a comparatively small number of local data.Practical scenarios include collaborative learning with mobile devices. We summarize our contributions as follows.• We propose a new double momentum SGD (DOMO) method with a novel server momentum fusion technique.• We provide the ﬁrst theoretical analysis involving both server and local momentum SGD in non-convex settings andnew insights into their connection.• Extensive deep federated learning experiments show that DOMO can improve the test accuracy by up to 5%compared with the state-of-the-art momentum-based method when training VGG-16 on CIFAR-10.

Table 1: List of basic notations.Training round (total) r ( R )Local training step (total) p ( P )Client (total) k ( K )Server, local learning rate α , η Server momentum fusion constant β Server, local momentum constant µ s , µ l Server, local momentum buffer m r , m ( k ) r,p Server, (average) local model x r , x ( k ) r,p ( x ( k ) r,p )Stochastic gradient ∇ F ( k ) ( x ( k ) r,p , ξ ( k ) r,p ) Full gradient ∇ f ( k ) ( x ( k ) r,p ) To begin with, consider federated learning as an optimization problem of min x ∈ R d K K − (cid:88) k =0 f ( k ) ( x ) , (1)where f ( k ) is the local loss function on client k , x is the model weights and K is the number of clients. All the basicnotations throughout this paper are listed in Table 1. In FedAvg, the client trains the local model for P steps using SGDwith gradient ∇ F ( k ) ( x ( k ) r,p , ξ ( k ) r,p ) and sends local model update x ( k ) r,P − x ( k ) r, to server. Server then takes an average andupdate the server model via x r +1 = x r − αK (cid:80) K − k =0 ( x ( k ) r,P − x ( k ) r, ) .2 PREPRINT - F EB

5, 2021

Algorithm 1

Double Momentum SGD (DOMO). Input: period P ≥ , number of rounds R , number of clients K , server learning rate α , local learning rate η , servermomentum constant µ s , local momentum constant µ l , server momentum fusion constant β . Initialize:

Server momentum buffer m = . ∀ k ∈ [ K ] , local model x ( k )0 , = x and local momentum buffer m ( k )0 , = . for r = 0 , , · · · , R − do Client k : ( r ≥ ) Receive x r and K (cid:80) K − k =0 m ( k ) r − ,P to initialize x ( k ) r, and m ( k ) r, . Receive m r from server. for p = 0 , , · · · , P − do Option I: x ( k ) r,p ← x ( k ) r,p − ηβP m r · p =0 m ( k ) r,p +1 = µ l m ( k ) r,p + ∇ F ( x ( k ) r,p , ξ ( k ) r,p ) Option I: x ( k ) r,p +1 = x ( k ) r,p − η m ( k ) r,p +1 Option II: x ( k ) r,p +1 = x ( k ) r,p − η m ( k ) r,p +1 − ηβ m r end for Send d ( k ) r = P (cid:80) P − p =0 m ( k ) r,p +1 and m ( k ) r,P to server. Server:

Receive d ( k ) r and m ( k ) r,P from client. m r +1 = µ s m r + K (cid:80) Kk =1 d ( k ) r x r +1 = x r − αηP m r +1 Send x r +1 , K (cid:80) Kk =1 m ( k ) r,P , and m r +1 to client. end for Output:Momentum-based.

State-of-the-art method server momentum SGD maintains a server momentum buffer with thelocal model update αK (cid:80) K − k =0 ( x ( k ) r,P − x ( k ) r, ) and is used to update the server model. While local momentum SGDmaintains a local momentum buffer with ∇ F ( k ) ( x ( K ) r,p , ξ ( k ) r,p ) and is used to update the local model. [29] empiricallyshowed that the server learning rate α = 1 in server momentum SGD. Adaptive. [22] applied the idea of using server statistics as in server momentum SGD to adaptive optimizers includingAdam [14], AdaGrad [6], and Yogi [32]. [22] showed that server learning rate should be smaller than O (1) in terms ofcomplexity, but the exact value was unknown. [22] also showed that convergence analysis with full participation couldbe easily generalized to partial participation as in cross-device federated learning. Inter-client Variance Reduction.

Variance reduction in federated learning refers to correct client drift caused bydifferent data distribution on different clients following the variance reduction convention. In contrast, traditionalstochastic methods based on variance reduction [11, 4] and popular in convex optimization can be seen as intra-clientvariance reduction. Scaffold [13] proposed to maintain a control variate c k on each client k and add K (cid:80) K − k =0 c k − c k to gradient ∇ F ( k ) r,p ( x ( k ) r,p , ξ ( k ) r,p ) when conducting local training. A prior work VRL-SGD [18] was built on a similar ideawith c k equal to the average local gradients in the last training round. Both Scaffold and VRL-SGD have to maintainlocal statistics and make the clients stateful, so they are more suitable for cross-silo federated learning. Mime proposedto apply server statistics locally to address this issue and can be seen as a combination of server statistics and variancereduction. However, Mime has to compute the full local gradient which can be prohibitive in cross-silo federatedlearning. Besides, Mime’s theoretical results are based on Storm [3] but their algorithm is based on standard Polyak’smomentum. Though theoretically appealing, variance reduction technique has shown to be ineffective in practicalneural networks’ optimization [5]. [5] argued that common tricks such as data augmentation, batch normalization, anddropout broke the transformation locking and deviated practice from theory. Other.

There are some other settings of federated learning including heterogeneous optimization [17, 28], fairness [20],etc. These different settings, variance reduction techniques, and server statistics can sometimes be combined.

We focus on improving momentum-based methods in federated learning and describe our new double momentum SGD(DOMO) algorithm in this section. 3

PREPRINT - F EB

5, 2021

Algorithm 2

Double Momentum SGD (DOMO, β = µ s ). Input: period P ≥ , number of rounds R , number of clients K , server learning rate α , local learning rate η , serverand local momentum constant µ s and µ l . Initialize:

Server momentum buffer m = . ∀ k ∈ [ K ] , local model x ( k )0 , = x and local momentum buffer m ( k )0 , = . for r = 0 , , · · · , R − do Client k : ( r ≥ ) Receive x r and K (cid:80) K − k =0 m ( k ) r − ,P to initialize x ( k ) r, and m ( k ) r, . Receive m r from server. for p = 0 , , · · · , P − do Option I: x ( k ) r,p ← x ( k ) r,p − ηµ s P m r · p =0 m ( k ) r,p +1 = µ l m ( k ) r,p + ∇ F ( x ( k ) r,p , ξ ( k ) r,p ) Option I: x ( k ) r,p +1 = x ( k ) r,p − η m ( k ) r,p +1 Option II: x ( k ) r,p +1 = x ( k ) r,p − η m ( k ) r,p +1 − ηµ s m r Option III: x ( k ) r,p +1 = x ( k ) r,p − η m ( k ) r,p +1 − ηµ s P m r · p = P − // Server + local momentum approach. end for

Send d ( k ) r = ηP ( x ( k ) r, − x ( k ) r,P ) and m ( k ) r,P to server. Server:

Receive d ( k ) r and m ( k ) r,P from client. m r +1 = K (cid:80) Kk =1 d ( k ) r , x r +1 = x r − αηP m r +1 Send x r +1 , K (cid:80) Kk =1 m ( k ) r,P , and m r +1 to client. end for Output:3.1 Double Momentum Buffers (DOMO & DOMO-S)

We illustrate the general form of DOMO in Algorithm 1. Different from all existing works, we maintain both the serverand local statistics (momentum buffers). However, the local momentum buffer does not make the clients in DOMOstateful. At the end of each training round, we will average the local momentum buffer such that the clients start fromthe same local momentum buffer in the next training round. In the ﬁrst place, we brieﬂy summarize the idea of DOMOin the following steps.1. Receive initial model and statistics from server in the beginning of the training round.2. Fuse the server momentum buffer in local training steps.3. Remove the effect of the server momentum buffer in the local model update before sending to the server.4. Aggregate local model updates from clients to update model and statistics at the server node.

Server Momentum Fusion in Local Training Steps.

For the corresponding part in Algorithm 1, client k ﬁrst initializeits local model x ( k ) r, , local momentum buffer m ( k ) r, , and server momentum buffer m r at the start of training round r inAlgorithm 1 line 5. In each local training step, apart from standard local momentum SGD (Algorithm 1 line 8), wepropose two options to fuse server momentum into local training steps (Algorithm 1 lines 7, 9, and 10). For simplicity,we denote option I as DOMO and option II as

DOMO-S with “S” standing for “scatter”. In DOMO, we apply servermomentum buffer m r with coefﬁcient βP and learning rate η to the local model only at the ﬁrst local training step( p = 0 ). β is the server momentum fusion constant. While in DOMO-S, we evenly scatter this process to all the ( P )local training steps, therefore the coefﬁcient becomes β instead of βP in Algorithm 1 line 10.Intuitively, the motivation behind DOMO is that the local model update direction should be adjusted by the direction ofserver momentum buffer to tackle the client drift issue, especially when the data distribution across clients is highlyskewed. Furthermore, DOMO-S follows this motivation in a more ﬁne-grained way and adjusts each local momentumSGD training step by server momentum buffer. Aggregate Local Model Updates without Server Momentum.

We propose to remove the effect of server momentum m r in local model updates (Algorithm 1 line 12) before aggregating it to server. The remaining part (Algorithm 1lines 14 to 16) in the server follows standard server momentum SGD. Mathematically speaking, the equivalent servermomentum constant would have been deviated to µ s + β if we would not remove it. Besides, the range of β would4 PREPRINT - F EB

5, 2021have been narrowed down to [0 , − µ s ) . Intuitively speaking, momentum buffer serves as a smoothed update directionof the stochastic gradient to reduce variance (or sampling noise). Smoothing itself is not necessary. To improve the understanding of the connection between server momentum SGD and local momentum SGD. herewe propose new concepts called pre-momentum, intra-momentum, and post-momentum. We will show that the naivecombination of server momentum and local momentum SGD works like post-momentum, while our proposed DOMOand DOMO-S work like pre-momentum and intra-momentum respectively.To illustrate these concepts, we turn Algorithm 1 into Algorithm 2 as an equivalent form when β = µ s . The options Iand II of Algorithm 2 is identical to the options I and II of Algorithm 1. This transformation leads to a different form ofthe update of server momentum buffer m r (Algorithm 2 line 16), but it is mathematically equivalent to its update rulein Algorithm 1. In the context of federated learning, we denote server momentum SGD, local momentum SGD, andthe naive combination of server and local momentum SGD as FedAvgSM , FedAvgLM , and

FedAvgSLM respectively.Then it is easy to see that FedAvgSLM is identical to option III in Algorithm 2. Speciﬁcally, server momentum can beinterpreted as post-momentum in FedAvgSLM because the current server momentum buffer m r is applied at the end ofthe training round ( p = P − ) and after all the local momentum SGD training steps are ﬁnished. In comparison, weﬁnd that DOMO applies the current server momentum buffer m r at the beginning of the training round ( p = 0 ) andbefore the local momentum SGD training starts, thus regarded as pre-momentum. While DOMO-S scatters the effectof current server momentum buffer m r and applies it during the local momentum SGD training steps. Therefore, weinterpret DOMO-S as intra-momentum.Consequently, we provided new insight into the connection between server momentum and local momentum SGD bylooking at the order of applying server momentum buffer and local momentum buffer. Considering that server and localmomentum buffers can be regarded as the smoothed server and local update direction, the order of applying which oneﬁrst shall not make much difference when the similarity of data distribution across clients is high, i.e. , the client driftissue is not severe. However, when the data similarity is low, it becomes more signiﬁcant to provide the information ofserver update direction during the local training as pre-momentum and intra-momentum do. In this section, we interpret the motivation behind DOMO from a theoretical perspective. It is the ﬁrst convergenceanalysis involving both server and local momentum SGD to the best of our knowledge. There has been little theoreticalanalysis even for the naive combination of server and local momentum SGD (FedAvgSLM). We consider non-convexsmooth objective function satisfying Assumption 1. We also assume that the local stochastic gradient is an unbiasedestimation of local full gradient and has a bounded variance in Assumption 2. Furthermore, we bound the non-i.i.d. datadistribution across clients in Assumption 3 which is looser than when B = 0 . We note that for i.i.d. data distribution asin data-center distributed training, G = 0 and B = 0 . For non-i.i.d. data distribution in federated learning, G measuresthe data similarity in different clients. Speciﬁcally, a low data similarity will lead to a larger G . Note that some basicnotations are listed in Table 1. Assumption 1 ( L -Lipschitz Smoothness) The global objective function f ( · ) and local objective function f ( k ) are L -smooth, i.e., (cid:107)∇ f ( k ) ( x ) − ∇ f ( k ) ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , (2) (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) , ∀ x , y ∈ R d , k ∈ [ K ] . Assumption 2 (Unbiased Gradient and Bounded Variance) The stochastic gradient ∇ F ( k ) ( x , ξ ) is an unbiasedestimation of the full gradient ∇ f ( k ) ( x ) , i.e., E ξ ∇ F ( x , ξ ) = ∇ f ( x ) , ∀ x ∈ R d . (3) Its variance is also bounded, i.e., E ξ (cid:107)∇ F ( x , ξ ) − ∇ f ( x ) (cid:107) ≤ σ , ∀ x ∈ R d . (4) Assumption 3 (Bounded Non-i.i.d. Distribution) For any client k ∈ [ K ] and x ∈ R d , there exists B ≥ and G ≥ ,the variance of the local full gradient in each client is upper bounded so that: K K − (cid:88) k =0 (cid:107)∇ f ( k ) ( x ) − ∇ f ( x ) (cid:107) ≤ G + B (cid:107)∇ f ( x ) (cid:107) . (5)5 PREPRINT - F EB

5, 2021

Theorem 1 (Convergence of DOMO) Assume Assumptions 1, 2, and 3 exist. Let P ≤ min { − µ l ηL , − µ l BηL } and − ηL − µ l η L (1 − µ l ) ≤ . When α = 1 − µ s and β = µ s , we have RP RP − (cid:88) rP + p =0 E (cid:107)∇ f ( x r,p ) (cid:107) ≤ − µ l )( f ∗ − f ( x ))3 ηRP + 4 ηLσ − µ l ) ( 1 K + 3 ηLP − µ l ) + 2 µ l ηL (1 − µ l ) K )+ 12 η L P G (1 − µ l ) . (6)According to Theorem 1, let η = O ( K R − P − ) and P = O ( K − R ) , then we have a convergence rate RP (cid:80) RP − rP + p =0 E (cid:107)∇ f ( x r,p ) (cid:107) = O ( K − R − P − ) which achieves a linear speedup regarding the number of clients K and the same iteration complexity as SGD. Lemma 1 (DOMO update rule) Suppose ≤ r (cid:48) ≤ r and ≤ p (cid:48) ≤ P . Let (cid:98) y r,p = x − αη (1 − µ s ) K K − (cid:88) k =0 rP + p − (cid:88) r (cid:48) P + p (cid:48) =0 m ( k ) r (cid:48) ,p (cid:48) +1 and z r,p = 11 − µ l (cid:98) y r,p − µ l − µ l (cid:98) y r,p − (7) where (cid:98) y , − = (cid:98) y , = x , then z r,p +1 − z r,p = − αη (1 − µ l )(1 − µ s ) K K − (cid:88) k =0 ∇ F ( k ) ( x ( k ) r,p , ξ ( k ) r,p ) (8) Sketch of Proof.

The key of the proof is to ﬁnd a novel auxiliary sequence { z r,p } that not only has a concise updaterule than the mixture of server and local momentum, but also is close to the average local model { x r,p } . Moreover, z r,P should equal z r +1 , to facilitate the analysis of the update between z r,P − and z r +1 , . Lemma 1 gives the updaterule of such an auxiliary sequence. In contrast, the analysis between x r,P − and x r +1 , is more tricky due to the servermomentum buffer applied at the end of the training round. Before to analyze the convergence of { x r,p } with the helpof { z r,p } , we only have to bound (cid:107) z r,p − x r,p (cid:107) ( inconsistency bound ) and K (cid:80) K − k =0 (cid:107) x r,p − x ( k ) r,p (cid:107) ( divergencebound ). The divergence bound measures how the local models on different clients diverges due to non-i.i.d. distributionand is more straightforward to analyze since it is only affected by local momentum SGD. The inconsistency boundmeasures the inconsistency between the auxiliary variable and the average local model as a trade-off for a more conciseupdate rule. Lemma 2 (Inconsistency Bound) Let h = µ l − µ l − (1 − α − µ s ) − µ pl − µ l , α ≥ (1 − µ s )(1 − µ l ) , and β = µ s − µ s α , we have RP − (cid:88) t =0 (cid:107) z r,p − x r,p (cid:107) ≤ η − µ l ( P − (cid:88) p =0 h µ pl − µ Pl ) · RP − (cid:88) t =0 (cid:107) K K − (cid:88) k =0 ∇ F ( k ) ( x ( k ) r,p , ξ ( k ) r,p ) (cid:107) . (9) Tighten Inconsistency Bound.

In Lemma 2, we show that server momentum buffer in DOMO can help tighten theinconsistency bound to improve the bound in Eq. (6). Note that server momentum buffer does not affect the divergencebound. Speciﬁcally, setting α = (1 − µ s )(1 − µ l ) and DOMO can scale the inconsistency bound done to about µ l + µ l of that in local momentum SGD ( α = 1 , µ s = 0 ). It is reasonable considering server momentum buffer carries historicallocal momentum information. Therefore, we connect the server and local momentum by showing the beneﬁt from thetheoretical perspective. The server momentum fusion technique is critical in the proof of Lemma 2. The improvement ofthe inconsistency bound is a constant factor. It does not affect the overall convergence rate but can accelerate the initialtraining when the learning rate is large, which is crucial in federated learning. Note that this improvement analysis hasnot reached the optimal due to inequality scaling. DOMO with β = µ s . From Lemma 2, we can see that by setting β = µ s , DOMO preserves the same inconsistencybound as local momentum SGD. There is good reason to make this choice and turn Algorithm 1 to Algorithm 2. Considermomentum buffer as a smoothed update direction. Suppose the update of server momentum buffer m r +1 = µ s m r + ∆ r becomes steady with ∆ r → ∆ , then m r becomes an estimation of ∆1 − µ s . With the coefﬁcient − µ s , the local momentumSGD is inconsistent with server momentum SGD in terms of the magnitude of the update. Setting α = 1 − µ s balancethe inconsistency and lead to β = µ s in Lemma 2. We note that in practice, α still needs tuning. For reference,[29] showed that the bound in convergence analysis is minimized when α = 1 − µ s , but α = 1 works best in theirexperiments. We also note that there is no µ s in Theorem 1 because it is removed from Lemma 2 by setting β = µ s .6 PREPRINT - F EB

5, 2021

Communication Round T e s t A cc u r a c y ( % ) VGG-16, s=5%, E=1

FedAvg (58.90±1.24)FedAvgSM (76.36±0.47)FedAvgLM (63.29±0.36)FedAvgLM-Z (63.84±1.29)FedAvgSLM (76.22±0.68)FedAvgSLM-Z (77.04±0.40)DOMO ( ±0.96)DOMO-S (80.25±0.71)

Communication Round T e s t A cc u r a c y ( % ) VGG-16, s=10%, E=1

FedAvg (63.39±1.64)FedAvgSM (82.10±0.45)FedAvgLM (71.06±0.33)FedAvgLM-Z (71.82±1.20)FedAvgSLM (83.41±0.17)FedAvgSLM-Z (81.78±0.42)DOMO ( ±0.11)DOMO-S (84.82±0.24)

Communication Round T e s t A cc u r a c y ( % ) VGG-16, s=20%, E=1

FedAvg (74.00±1.36)FedAvgSM (85.88±0.30)FedAvgLM (76.41±0.41)FedAvgLM-Z (79.19±0.64)FedAvgSLM (87.48±0.12)FedAvgSLM-Z (87.20±0.22)DOMO ( ±0.19)DOMO-S (88.23±0.32)

Communication Round T e s t L o ss VGG-16, s=5%, E=1

FedAvgFedAvgSMFedAvgLMFedAvgLM-Z FedAvgSLMFedAvgSLM-ZDOMODOMO-S

Communication Round T e s t L o ss VGG-16, s=10%, E=1

FedAvgFedAvgSMFedAvgLMFedAvgLM-Z FedAvgSLMFedAvgSLM-ZDOMODOMO-S

Communication Round T e s t L o ss VGG-16, s=20%, E=1

FedAvgFedAvgSMFedAvgLMFedAvgLM-Z FedAvgSLMFedAvgSLM-ZDOMODOMO-S

Figure 1: Training curves using the VGG-16 model with various data similarity s . Best viewed in color. Local Momentum Constant l . . . . S e r v e r M o m e n t u m C o n s t a n t s VGG-16, s=5%, E=1, DOMO

Local Momentum Constant l . . . . S e r v e r M o m e n t u m C o n s t a n t s VGG-16, s=10%, E=1, DOMO

Local Momentum Constant l . . . . S e r v e r M o m e n t u m C o n s t a n t s VGG-16, s=20%, E=1, DOMO

Figure 2: Test accuracy (%) with various sever momentum constant µ s and local momentum constant µ l . µ s = 0 corresponds to FedAvgLM, µ l = 0 corresponds to FedAvgLM, µ s = 0 & µ l = 0 corresponds to FedAvg, and µ s (cid:54) = 0 & µ l (cid:54) = 0 corresponds to DOMO. Best viewed in color. All experiments are implemented using PyTorch [21] and run on a cluster where each node is equipped with 4 TeslaP40 GPUs and 64 Intel(R) Xeon(R) CPU E5-2683 v4 cores @ 2.10GHz. We compare the following momentum-basedmethods: 1) FedAvg, 2) FedAvgSM ( i.e. , server momentum SGD), 3) FedAvgLM ( i.e. , local momentum SGD), 4)FedAvgLM-Z ( i.e. , local momentum SGD with local momentum buffer reset to zero [24] after each training round), 5)FedAvgSLM ( i.e. , FedAvgSM + FedAvgLM), 6) FedAvgSLM-Z ( i.e. , FedAvgSM + FedAvgLM-Z), 7) DOMO ( i.e. ,option I), and 8) DOMO-S ( i.e. , option II). By default, we set α = 1 as suggested in [29] and β = µ s unless speciﬁedotherwise. The local momentum constant µ l is tuned from {0.9, 0.8, 0.6, 0.4, 0.2}. We tune the server momentumconstant µ s from {0.9, 0.6, 0.3} which is more course-grained because µ s = 0 . already works best for all methods inall our experiments. µ s = 0 . and µ l = 0 . by default unless speciﬁed otherwise. The base learning rate is tuned from{..., × − , × − , × − , × − , × − , × − , ...}. We test local epoch E ∈ { . , , } and E = 1 by default. Data Similarity s . We follow the practice in [13] to simulate the non-i.i.d. data distribution. Speciﬁcally, fraction s of the data are randomly selected and allocated to clients, while the remaining fraction − s are allocated by sorting7 PREPRINT - F EB

5, 2021

Communication Round T e s t A cc u r a c y ( % ) VGG-16, s=10%, E=0.5

FedAvg (64.50±0.54)FedAvgSM (84.14±0.27)FedAvgLM (71.73±0.88)FedAvgLM-Z (71.26±0.58)FedAvgSLM (85.34±0.36)FedAvgSLM-Z (85.34±0.20)DOMO ( ±0.57)DOMO-S ( ±0.29)

Communication Round T e s t A cc u r a c y ( % ) VGG-16, s=10%, E=2

FedAvg (59.27±0.50)FedAvgSM (79.52±0.61)FedAvgLM (64.77±0.98)FedAvgLM-Z (64.89±0.61)FedAvgSLM (80.60±0.98)FedAvgSLM-Z (81.86±0.29)DOMO ( ±0.23)DOMO-S (82.29±0.62)

Figure 3: Training curves using the VGG-16 model with data similarity s = 10% and various local epoch E . E = 1 has been shown in the middle plot of Figure 1 and is not repeatedly shown here. Best viewed in color. Communication Round T e s t A cc u r a c y ( % ) ResNet-56, s=10%, E=1

FedAvg (61.90±0.78)FedAvgSM (78.98±0.37)FedAvgLM (63.53±0.35)FedAvgLM-Z (63.20±0.80)FedAvgSLM (77.92±0.34)FedAvgSLM-Z (77.90±0.47)DOMO ( ±0.15)DOMO-S (79.95±0.38)

Communication Round T e s t L o ss ResNet-56, s=10%, E=1

FedAvgFedAvgSMFedAvgLMFedAvgLM-Z FedAvgSLMFedAvgSLM-ZDOMODOMO-S

Figure 4: Training curves using the ResNet-56 model with data similarity s = 10% and local epoch E = 1 . Bestviewed in color.according to the label. The data similarity is hence s . We run experiments with data similarity s in {5%, 10%, 20%}.By default, the data similarity is set to 10% and the number of clients K = 16 . For all experiments, We report the meanand standard deviation metrics in the form of ( mean ± std ) over 3 runs with different random seeds for allocating datato clients. Dataset.

We train the VGG-16 [25] and ResNet-56 [7] models on CIFAR-10 [16] image classiﬁcation task. ForVGG-16, there is no batch normalization [10] layer. For ResNet-56, we replace the batch normalization layer with thegroup normalization [30] layer because non-i.i.d. data distribution causes inaccurate batch statistics estimation andworsens the client drift issue. The number of groups in group normalization is set to 8. The local batch size b = 32 andthe total batch size B = Kb = 512 . In this setting, the number of local training step P = 98 when the local epoch E = 1 . The weight decay is × − . The model is trained for 200 epochs with a learning rate decay of 0.1 at epoch120 and 160. Random cropping, random ﬂipping, and standardization are applied as data augmentation techniques. We illustrate the experimental results in Figures 1, 2, 3, and 4 with test accuracy ( mean ± std ) reported in the bracketsof the legend. Testing performance is the main metric for comparison in federated learning because local trainingmetrics become less meaningful with clients tending to overﬁt their local data during local training. In overall, DOMO (cid:38)

DOMO-S > FedAvgSLM-Z (cid:38)

FedAvgSLM > FedAvgSM > FedAvgLM-Z (cid:38)

FedAvgLM > FedAvg regardingthe test accuracy. DOMO and DOMO-S consistently achieve the fastest empirical convergence rate and best testaccuracy in all experiments. On the contrary, the initial convergence rate of FedAvgSLM and FedAvgSLM-Z can evenbe worse than FedAvgSM. Besides, using server statistics is much better than without it (FedAvgSM (cid:29)

FedAvg andFedAvgSLM(-Z) (cid:29)

FedAvgLM(-Z)), in consist with the result in [9]. PREPRINT - F EB

5, 2021 α = 1 . β = 0 . β = 1 . ± α = 1 . ± β = 0 . ± α = 0 . ± β = 0 . ± α = 0 . ± β = 0 . ± α = 0 . ± β = 0 . ± α = 0 . ± β = 0 . ± α = 0 . ± α and β . Data similarity s = 10% and local epoch E = 1 . α is ﬁxed at 1.0 with various β in the ﬁrst column, while β is ﬁxed at 0.9 with various α in the second column. Varying Data Similarity s . We plot the training curves under different data similarity settings in Figure 1. We can seethat the improvement of DOMO and DOMO-S over other momentum-based methods increases with the data similarity s decreasing. This property makes our proposed method favorable in federated learning where the data heterogeneitycan be complicated. In particular, DOMO improves FedAvgSLM-Z, FedAvgSM, and FedAvg by respectively regarding the test accuracy when s = 5% . When s = 10% and s = 20% , DOMO improves overthe best counterpart by and respectively, while DOMO-S improves by 1.41% and 0.85% respectively. Varying the Server and Local Momentum Constant µ s , µ l . We explore the various combinations of server and localmomentum constant µ s and µ l of DOMO and report the test accuracy in Figure 2. µ s = 0 . and µ l = 0 . work bestregardless of the data similarity s and the algorithm we use. Deviating from µ s = 0 . and µ l = 0 . leads to graduallylower test accuracy. Varying the Local Epoch E . We plot the training curves of VGG-16 under different local epoch E settings in Figure3 with data similarity s = 10% . The number of local training steps P = 49 and 196 respectively when E = 0 . and 2.We can see that DOMO improves the test accuracy over the best counterpart by and respectively when E = 0 . and 2. Varying Hyper-parameters α and β . We explore the combinations of hyper-parameters α and β and report thecorresponding test accuracy in Table 2 to verify the default choice of α = 1 . and β = 0 . in our experiments. Varying Model.

We also plot the training curves of ResNet-56 in Figure 4 which exhibit a similar pattern. DOMOimproves the best counterpart by when data similarity s = 10% and local epoch E = 1 . In particular, we ﬁndthat FedAvgSLM and FedAvgSLM-Z are inferior to FedAvgSM which implies that a naive combination of server andlocal momentum SGD may even hurt the performance. In contrast, DOMO and DOMO-S improve FedAvgSLM by and . In this work, we have presented a new double momentum SGD (DOMO) method with a novel server momentum fusiontechnique to improve the state-of-the-art momentum-based federated learning algorithm. We provided new insights intothe connection between the server and local momentum with the new concepts of pre-momentum, intra-momentum,and post-momentum in DOMO. We also provided the ﬁrst convergence analysis involving both the server and localmomentum SGD. From a theoretical perspective, we elaborated this connection by showing that server momentumcould lead to a tighter inconsistency bound in DOMO. Future works may include incorporating the inter-client variancereduction technique to tighten the divergence bound as well. Deep federated learning experimental results verify theeffectiveness of DOMO. DOMO can achieve an improvement of up to 5% regarding the test accuracy compared withthe state-of-the-art momentum-based method when training VGG-16 on CIFAR-10.

References [1] K. Bonawitz, H. Eichner, W. Grieskamp, D. Huba, A. Ingerman, V. Ivanov, C. Kiddon, J. Koneˇcn`y, S. Mazzocchi,H. B. McMahan, et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046 ,2019.[2] K. Chen and Q. Huo. Scalable training of deep learning machines by incremental block training with intra-blockparallel optimization and blockwise model-update ﬁltering. In , pages 5880–5884. IEEE, 2016.9

PREPRINT - F EB

5, 2021[3] A. Cutkosky and F. Orabona. Momentum-based variance reduction in non-convex sgd. In

Advances in NeuralInformation Processing Systems , pages 15236–15245, 2019.[4] A. Defazio, F. Bach, and S. Lacoste-Julien. Saga: A fast incremental gradient method with support for non-stronglyconvex composite objectives.

Advances in neural information processing systems , 27:1646–1654, 2014.[5] A. Defazio and L. Bottou. On the ineffectiveness of variance reduced optimization for deep learning. arXivpreprint arXiv:1812.04529 , 2018.[6] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimization.

Journal of machine learning research , 12(7), 2011.[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 770–778, 2016.[8] K. Hsieh, A. Phanishayee, O. Mutlu, and P. Gibbons. The non-iid data quagmire of decentralized machine learning.In

International Conference on Machine Learning , pages 4387–4398. PMLR, 2020.[9] T.-M. H. Hsu, H. Qi, and M. Brown. Measuring the effects of non-identical data distribution for federated visualclassiﬁcation. arXiv preprint arXiv:1909.06335 , 2019.[10] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariateshift. In

International conference on machine learning , pages 448–456. PMLR, 2015.[11] R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction.

Advancesin neural information processing systems , 26:315–323, 2013.[12] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N. Bhagoji, K. Bonawitz, Z. Charles, G. Cormode,R. Cummings, et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977 , 2019.[13] S. P. Karimireddy, S. Kale, M. Mohri, S. Reddi, S. Stich, and A. T. Suresh. Scaffold: Stochastic controlledaveraging for federated learning. In

International Conference on Machine Learning , pages 5132–5143. PMLR,2020.[14] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014.[15] J. Koneˇcn`y, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon. Federated learning: Strategiesfor improving communication efﬁciency. arXiv preprint arXiv:1610.05492 , 2016.[16] A. Krizhevsky, G. Hinton, et al. Learning multiple layers of features from tiny images. 2009.[17] T. Li, A. K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, and V. Smith. Federated optimization in heterogeneousnetworks. arXiv preprint arXiv:1812.06127 , 2018.[18] X. Liang, S. Shen, J. Liu, Z. Pan, E. Chen, and Y. Cheng. Variance reduced local sgd with lower communicationcomplexity. arXiv preprint arXiv:1912.12844 , 2019.[19] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efﬁcient learning of deepnetworks from decentralized data. In

Artiﬁcial Intelligence and Statistics , pages 1273–1282. PMLR, 2017.[20] M. Mohri, G. Sivek, and A. T. Suresh. Agnostic federated learning. In

International Conference on MachineLearning , pages 4615–4625, 2019.[21] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga,et al. Pytorch: An imperative style, high-performance deep learning library.

Advances in Neural InformationProcessing Systems , 32:8026–8037, 2019.[22] S. Reddi, Z. Charles, M. Zaheer, Z. Garrett, K. Rush, J. Koneˇcn`y, S. Kumar, and H. B. McMahan. Adaptivefederated optimization. arXiv preprint arXiv:2003.00295 , 2020.[23] D. Rothchild, A. Panda, E. Ullah, N. Ivkin, I. Stoica, V. Braverman, J. Gonzalez, and R. Arora. Fetchsgd:Communication-efﬁcient federated learning with sketching. In

International Conference on Machine Learning ,pages 8253–8265. PMLR, 2020.[24] F. Seide and A. Agarwal. Cntk: Microsoft’s open-source deep-learning toolkit. In

Proceedings of the 22nd ACMSIGKDD International Conference on Knowledge Discovery and Data Mining , pages 2135–2135, 2016.[25] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXivpreprint arXiv:1409.1556 , 2014.[26] S. U. Stich. Local sgd converges fast and communicates little. In

International Conference on LearningRepresentations , 2018.[27] I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deeplearning. In

International conference on machine learning , pages 1139–1147, 2013.10

PREPRINT - F EB

5, 2021[28] J. Wang, Q. Liu, H. Liang, G. Joshi, and H. V. Poor. Tackling the objective inconsistency problem in heterogeneousfederated optimization. arXiv preprint arXiv:2007.07481 , 2020.[29] J. Wang, V. Tantia, N. Ballas, and M. Rabbat. Slowmo: Improving communication-efﬁcient distributed sgd withslow momentum. In

International Conference on Learning Representations , 2019.[30] Y. Wu and K. He. Group normalization. In

Proceedings of the European conference on computer vision (ECCV) ,pages 3–19, 2018.[31] H. Yu, R. Jin, and S. Yang. On the linear speedup analysis of communication efﬁcient momentum sgd fordistributed non-convex optimization. In

International Conference on Machine Learning , pages 7184–7193, 2019.[32] M. Zaheer, S. Reddi, D. Sachan, S. Kale, and S. Kumar. Adaptive methods for nonconvex optimization. In

Advances in neural information processing systems , pages 9793–9803, 2018.[33] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra. Federated learning with non-iid data. arXiv preprintarXiv:1806.00582arXiv preprintarXiv:1806.00582