[PDF] Byzantine-Robust Variance-Reduced Federated Learning over Distributed Non-i.i.d. Data

Abstract

We propose a Byzantine-robust variance-reduced stochastic gradient descent (SGD) method to solve the distributed finite-sum minimization problem when the data on the workers are not independent and identically distributed (i.i.d.). During the learning process, an unknown number of Byzantine workers may send malicious messages to the master node, leading to remarkable learning error. Most of the Byzantine-robust methods address this issue by using robust aggregation rules to aggregate the received messages, but rely on the assumption that all the regular workers have i.i.d. data, which is not the case in many federated learning applications. In light of the significance of reducing stochastic gradient noise for mitigating the effect of Byzantine attacks, we use a resampling strategy to reduce the impact of both inner variation (that describes the sample heterogeneity on every regular worker) and outer variation (that describes the sample heterogeneity among the regular workers), along with a stochastic average gradient algorithm (SAGA) to fully eliminate the inner variation. The variance-reduced messages are then aggregated with a robust geometric median operator. Under certain conditions, we prove that the proposed method reaches a neighborhood of the optimal solution with linear convergence rate, and the learning error is much smaller than those given by the state-of-the-art methods in the non-i.i.d. setting. Numerical experiments corroborate the theoretical results and show satisfactory performance of the proposed method.

Full PDF

BByzantine-Robust Variance-Reduced Federated Learning over DistributedNon-i.i.d. Data

Jie Peng Zhaoxian Wu Qing Ling † School of Data and Computer Science, Sun Yat-Sen University,Guangzhou, Guangdong, China { pengj95, wuzhx23 } @mail2.sysu.edu.cn, [email protected] Abstract

We propose a Byzantine-robust variance-reduced stochasticgradient descent (SGD) method to solve the distributed ﬁnite-sum minimization problem when the data on the workersare not independent and identically distributed (i.i.d.). Dur-ing the learning process, an unknown number of Byzantineworkers may send malicious messages to the master node,leading to remarkable learning error. Most of the Byzantine-robust methods address this issue by using robust aggregationrules to aggregate the received messages, but rely on the as-sumption that all the regular workers have i.i.d. data, whichis not the case in many federated learning applications. Inlight of the signiﬁcance of reducing stochastic gradient noisefor mitigating the effect of Byzantine attacks, we use a re-sampling strategy to reduce the impact of both inner varia-tion (that describes the sample heterogeneity on every regularworker) and outer variation (that describes the sample hetero-geneity among the regular workers), along with a stochasticaverage gradient algorithm (SAGA) to fully eliminate the in-ner variation. The variance-reduced messages are then aggre-gated with a robust geometric median operator. Under cer-tain conditions, we prove that the proposed method reachesa neighborhood of the optimal solution with linear conver-gence rate, and the learning error is much smaller than thosegiven by the state-of-the-art methods in the non-i.i.d. setting.Numerical experiments corroborate the theoretical results andshow satisfactory performance of the proposed method.

Introduction

With the rapid increase of data volume and computingpower, the past decades have witnessed the explosive de-velopment of machine learning. However, many machinelearning methods require a central server or a cloud to col-lect data from various owners and train models in a central-ized manner, leading to serious privacy concerns. To addressthis issue, federated learning proposes to keep data privateat their owners and carry out machine learning tasks locally(Koneˇcn`y et al. 2016a,b; McMahan et al. 2016; Yang et al.2019). Every data owner (called worker thereafter) performslocal computation based on its local data and sends the re-sults (such as local models, gradients, stochastic gradients,etc) to the central server. The central server (called mas-ter node thereafter) aggregates the results received from theworkers and updates the global model. † corresponding author. Nevertheless, the distributed nature of federated learn-ing makes it vulnerable to attacks (Kairouz et al. 2019).Due to the heterogeneity of federated learning systems, notall workers are reliable. During the learning process, someworkers might be malfunctioning or even malicious, andsend faulty messages to the master node. This paper consid-ers the classical Byzantine attack model (Lamport, Shostak,and Pease 1982), where an unknown number of Byzan-tine workers are omniscient, collude with each other, andsend arbitrary malicious messages. Misled by the Byzantineworkers, the aggregation at the master node is problematic,such that the federated learning method converges to an un-satisfactory model or even diverge. For instance, the populardistributed stochastic gradient descent (SGD) method failsat presence of Byzantine attacks (Chen, Su, and Xu 2017).Most of the Byzantine-robust distributed methods mod-ify distributed SGD to handle Byzantine attacks, by replac-ing mean aggregation with robust aggregation at the masternode (Yang, Gang, and Bajwa 2020; Xie, Koyejo, and Gupta2020). When the data on the workers are independent andidentically distributed (i.i.d.) and the local cost functions arein the same form, stochastic gradients computed at the samepoint are i.i.d. too. Therefore, various robust aggregationrules, such as geometric median (Chen, Su, and Xu 2017),can be applied to alleviate the effect of statistically biasedmessages sent by Byzantine workers. Unfortunately, the dataon the workers are often non-i.i.d. in federated learning ap-plications (Zhao et al. 2018; Li et al. 2020b). Thus, the mes-sages sent by the regular workers are no longer i.i.d. suchthat handling Byzantine attacks becomes more challenging.In the Byzantine-robust federated learning setting, this pa-per considers the distributed ﬁnite-sum minimization prob-lem when the workers have non-i.i.d. data. The contributionsof this paper are summarized as follows.

C1)

In light of the signiﬁcance of reducing stochastic gradi-ent noise for mitigating the effect of Byzantine attacks, wepropose a Byzantine-robust variance-reduced SGD method,which uses a resampling strategy to reduce the impact ofboth inner and outer variations while a stochastic averagegradient algorithm (SAGA) to fully eliminate the inner vari-ation. The variance-reduced messages are then aggregatedwith a robust geometric median operator. C2 ) Under certain conditions, we prove that the proposedmethod reaches a neighborhood of the optimal solution a r X i v : . [ c s . L G ] S e p ith linear convergence rate, and the learning error is muchsmaller than those given by the state-of-the-art methods inthe non-i.i.d. setting. C3 ) We conduct numerical experiments on convex and non-convex learning problems and in i.i.d. and non-i.i.d. settings.The experimental results corroborate the theoretical ﬁndingsand show satisfactory performance of the proposed method. Related works

Byzantine-robust distributed machine learning has attractedmuch attention in recent years. In distributed SGD, the mas-ter node aggregates messages received from the workersby taking average and uses the mean as aggregated gradi-ent direction. The mean aggregation, however, is vulnerableto Byzantine attacks. Existing Byzantine-robust distributedmethods mostly extend the distributed SGD with robust ag-gregation rules, such as geometric median (Chen, Su, and Xu2017), coordinate-wise median (Yin et al. 2018), coordinate-wise trimmed mean (Yin et al. 2018), Krum (Blanchardet al. 2017), multi-Krum (Blanchard et al. 2017), Bulyan(El Mhamdi, Guerraoui, and Rouault 2018), to name a few.Another approach is to detect and discard outliers from thereceived messages (Rodr´ıguez-Barroso et al. 2020; Azulayet al. 2020; Li et al. 2020a). When the received stochasticgradients from the regular workers satisfy the i.i.d. assump-tion, the statistically different malicious messages from theByzantine workers can be detected and discarded, or theirnegative effect can be alleviated by robust aggregation rules.Aiming at Byzantine-robustness with distributed non-i.i.d. data, (Li et al. 2019) proposes a robust stochastic aggre-gation method that employs model aggregation rather thanstochastic gradient aggregation. Forced by the introducedconsensus constraints, the regular workers and the masternode shall reach consensus on their local models, no mat-ter the local data are i.i.d. or not. (Dong et al. 2020) alsoimposes asymptotic consensus between the regular workersand the master node, and proposes a Byzantine-robust prox-imal stochastic gradient method. (Ghosh et al. 2019) dividesthe cost functions of the workers into several clusters, suchthat within each cluster the i.i.d. assumption approximatelyholds. Then robust aggregation can be applied within eachcluster. (He, Karimireddy, and Jaggi 2020) introduces a re-sampling strategy to reduce the heterogeneity of the receivedmessages in the non-i.i.d. setting.In both i.i.d. and non-i.i.d. settings, variance of the mes-sages sent by the regular workers plays a critical role toByzantine-robustness. Larger variance means that the ma-licious messages are harder to distinguish. Theoretically,the variance can be classiﬁed into inner variation that de-scribes the sample heterogeneity on every regular workerand outer variation that describes the sample heterogene-ity among the regular workers. In the i.i.d. setting the in-ner variation often dominates, while in the non-i.i.d. set-ting the outer variation can be large. (Wu et al. 2020) usesSAGA to correct the stochastic gradients and aggregate thecorrected stochastic gradients with geometric median. It hasbeen proven that the impact of inner variation is fully elim-inated from the learning error. (Khanduri et al. 2019) com-bines stochastic variance-reduced gradient (SVRG) with ro- bust aggregation to solve distributed non-convex problems.(El Mhamdi, Guerraoui, and Rouault 2020) proves that themomentum method can reduce the variance of the stochas-tic gradients for the regular workers relative to their norm,and is thus helpful to Byzantine-robustness. The resamplingstrategy used in (He, Karimireddy, and Jaggi 2020) reducesthe impact of inner and outer variations simultaneously.

Problem Formulation and Preliminaries

Consider a distributed network containing one master nodeand W := R + B workers, among which R workers are reg-ular and B workers are Byzantine. During the learning pro-cess, the Byzantine workers are supposed to be omniscient,and can collude to send arbitrary malicious messages to themaster node (Lamport, Shostak, and Pease 1982). Denotethe sets of regular and Byzantine workers as R and B , re-spectively, with R = |R| and B = |B| . Denote W := R ∪ B as the set of all workers, with W = |W| . The goal is to ﬁndan optimal solution to the ﬁnite-sum minimization problem x ∗ = arg min x f ( x ) := 1 R (cid:88) w ∈R f w ( x ) , (1)with f w ( x ) := 1 J J (cid:88) j =1 f w,j ( x ) . (2)Here x ∈ R p is the optimization variable, and f w ( x ) is thelocal cost function of regular worker w that averages thecosts f w,j ( x ) of J samples. The samples are not necessar-ily i.i.d. across the regular workers. This form of ﬁnite-summinimization problem arises in many federated learning ap-plications (Yang et al. 2019). Geometric median.

When the Byzantine workers are ab-sent, distributed SGD (Bottou 2010) is a popular method tosolve (1). The update of distributed SGD is given by x k +1 = x k − γ · W W (cid:88) w =1 f (cid:48) w,i kw ( x k ) , (3)where γ is the step size and i kw is the sample index cho-sen by worker w at time k . Upon receiving x k from themaster node, every worker w computes a stochastic gradient f (cid:48) w,i kw ( x k ) and returns to the master node. The master nodethen averages the received stochastic gradients and updates x k +1 . However, distributed SGD is vulnerable to Byzantineattacks. Even there is only one Byzantine worker, it can re-place the true stochastic gradient with a malicious message,such that the average at the master node becomes zero orinﬁnitely large (Yang, Gang, and Bajwa 2020).To address this issue, most of the Byzantine-robust meth-ods aggregate the received messages with robust aggregationrules, among which we focus on geometric median (Chen,Su, and Xu 2017). To be speciﬁc, deﬁne g kw as the messagesent by worker w at time k to the master node, given by g kw = (cid:40) f (cid:48) w,i kw ( x k ) , w ∈ R , ∗ , w ∈ B . (4)hen w is a regular worker, it sends the true stochastic gra-dient f (cid:48) w,i kw ( x k ) . Otherwise when w is a Byzantine worker,it sends an arbitrary message denoted by ∗ . Upon receivingall the messages, the master node calculates the geometricmedian of { g kw , w ∈ W} , which is geomed (cid:0) { g kw , w ∈ W} (cid:1) := arg min g W (cid:88) w =1 (cid:107) g − g kw (cid:107) . (5)Then, the master node updates x k +1 by replacing the aver-age in (3) with the geometric median, as x k +1 = x k − γ · geomed (cid:0) { g kw , w ∈ W} (cid:1) . (6)It has been proved that when the regular workers have i.i.d.data, incorporating distributed SGD and geometric medianallows us to tolerate Byzantine attacks when less than halfof the workers are malicious, namely, B < W (Chen, Su,and Xu 2017). Byrd-SAGA.

Intuitively, when the stochastic gradients sentby the regular workers have large variance, it is difﬁcult todistinguish the malicious messages from the true stochasticgradients. This motivates the combination of variance reduc-tion techniques with the robust aggregation rules. One ex-ample is Byrd-SAGA that combines SAGA with geometricmedian (Wu et al. 2020). Instead of sending the stochasticgradients to the master node, the regular workers send thecorrected, variance-reduced stochastic gradients. To be spe-ciﬁc, every regular worker stores the most recent stochasticgradient for every of its samples. When regular worker w se-lects the sample index i kw at time k , the corrected stochasticgradient is f (cid:48) w,i kw ( x k ) − f (cid:48) w,i kw ( φ kw,i kw ) + 1 J J (cid:88) j =1 f (cid:48) w,j ( φ kw,j ) , (7)where φ k +1 w,j = (cid:40) φ kw,j , j (cid:54) = i kw ,x k , j = i kw . (8)That is, the stochastic gradient f (cid:48) w,i kw ( x k ) is corrected byﬁrst subtracting the previously stored stochastic gradient f (cid:48) w,i kw ( φ kw,i kw ) for sample i kw , and then adding the averageof all the stored stochastic gradients f (cid:48) w,j ( φ kw,j ) . At time k ,denote v kw as the corrected stochastic gradient if w is regularand an arbitrary message ∗ if w is Byzantine, given by v kw =  f (cid:48) w,i kw ( x k ) − f (cid:48) w,i kw ( φ kw,i kw )+ 1 J J (cid:88) j =1 f (cid:48) w,j ( φ kw,j ) , w ∈ R , ∗ , w ∈ B . (9)Then, the update of Byrd-SAGA is x k +1 = x k − γ · geomed (cid:0) { v kw , w ∈ W} (cid:1) . (10)It has been proven in (Wu et al. 2020) that Byrd-SAGA caneliminate the inner variation but the effect of outer variation still exists. For the non-i.i.d. case, the outer variation can bevery large, such that the learning error of Byrd-SAGA is stillconsiderable in federated learning applications. Resampling. (He, Karimireddy, and Jaggi 2020) proposes aresampling strategy to reduce the impact of both inner vari-ation and outer variation for the the non-i.i.d. case, as illus-trated in Algorithm 1. Consider the messages { g kw , w ∈ W} deﬁned in (4). The master node takes W rounds of sampling,and samples s messages at every round. When one messagehas been sampled for s times, it is excluded from the pool.At the end of round w , the master node averages the s sam-pled messages to calculate ˜ g kw . We denote this resamplingprocedure with s -replacement as { ˜ g kw , w ∈ W} = resampling (cid:0) { g kw , w ∈ W} , s (cid:1) . (11)Then, (He, Karimireddy, and Jaggi 2020) uses Krum (Blan-chard et al. 2017) to aggregate the new messages { ˜ g kw , w ∈W} . However, the resampling strategy is unable to fullyeliminate either the inner variation or the outer variation, andthus the performance of robust aggregation is still limited. Algorithm 1

Resampling with s -replacement (He, Karim-ireddy, and Jaggi 2020) Input: { g kw , w ∈ W} , s Initialize { c w = 0 , w ∈ W} for w = 1 , · · · , W do for ζ = 1 , · · · , s do Select w ζ ∼ Uniform ( { w | w ∈ W , c w < s } ) c w ζ = c w ζ + 1 Compute average ˜ g kw = s (cid:80) sζ =1 g kw ζ Return { ˜ g kw , w ∈ W} Algorithm Development

As discussed above, Byrd-SAGA can eliminate the innervariation but the effect of outer variation still exists, whilethe resampling strategy only reduces the impact of inner andouter variations. Neither of them is able to adequately ad-dress the issue of robust aggregation over distributed non-i.i.d. data, where both variations can be large. Therefore, wepropose to use a resampling strategy to reduce the impact ofinner and outer variations, along with SAGA to fully elimi-nate the inner variation. The variance-reduced messages arethen aggregated with a robust geometric median operator.Our proposed method is described in Algorithm 2. At time k , the master node broadcast its current variable x k to all theworkers. Upon receiving x k , every regular worker w ∈ R uniformly randomly selects a sample i kw ∈ { , · · · , J } to obtain the stochastic gradient f (cid:48) w,i kw ( x k ) , calculates thecorrected stochastic gradient v kw according to (9), stores f (cid:48) w,i kw ( x k ) in the stochastic gradient table, and then sendsthe corrected stochastic gradient v kw to the master node. Incontrast, every Byzantine worker w ∈ B sends an arbitrarymalicious message v kw to the master node. After receivingthe messages { v kw , w ∈ W} from all the workers, the masterode generates a set of new messages { ˜ v kw , w ∈ W} usingAlgorithm 1, given by { ˜ v kw , w ∈ W} = resampling (cid:0) { v kw , w ∈ W} , s (cid:1) . (12)Finally, the master node updates the variable from x k +1 = x k − γ · geomed (cid:0) { ˜ v kw , w ∈ W} (cid:1) . (13) Algorithm 2

Byzantine-Robust Variance-Reduced Feder-ated Learning over Distributed Non-i.i.d. Data

Master node : Input: x ∈ R p , γ , s . At time k : Broadcast its current variable x k to all workers Receive messages { v kw , w ∈ W} from all workers Generate { ˜ v kw , w ∈ W} according to Algorithm 1 withparameter s Update x k +1 = x k − γ · geomed (cid:0) { ˜ v kw , w ∈ W} (cid:1) Regular worker w : Initialize: { f (cid:48) w,j ( φ w,j ) = f (cid:48) w,j ( x ) , j ∈ { , · · · , J }} .At time k : Receive variable x k from master node Compute ¯ g kw = J (cid:80) Jj =1 f (cid:48) w,j ( φ kw,j ) . Sample i kw uniformly randomly from j ∈ { , · · · , J } Update v kw = f (cid:48) w,i kw ( x k ) − f (cid:48) w,i kw ( φ kw,i kw ) + ¯ g kw Store stochastic gradient f (cid:48) w,i kw ( φ kw,i kw ) = f (cid:48) w,i kw ( x k ) Send corrected stochastic gradient v kw to master node Theoretical Analysis

In this section, we theoretically analyze the performance ofthe proposed method, as well as investigate the effect of re-ducing the inner and outer variations on robust aggregation.We start from introducing three assumptions on the sam-ple costs { f w,j , w ∈ R , j = { , · · · , J }} . We use E to de-note the expectation with respect to all random variables i kw . Assumption 1. (Strong convexity and Lipschitz continuousgradients) The function f is µ -strong convex and has L -Lipschitz continuous gradients. Namely, for any x, y ∈ R p ,it holds that f ( y ) ≥ f ( x ) + (cid:104) f (cid:48) ( x ) , y − x (cid:105) + µ (cid:107) y − x (cid:107) , (14) and (cid:107) f (cid:48) ( y ) − f (cid:48) ( x ) (cid:107) ≤ L (cid:107) y − x (cid:107) . (15) Assumption 2. (Bounded outer variation) For any x ∈ R p ,the variation of the local gradients at the regular workerswith respect to the global gradient is upper-bounded by R (cid:88) w ∈R (cid:107) f (cid:48) w ( x ) − f (cid:48) ( x ) (cid:107) ≤ δ . (16) Assumption 3. (Bounded inner variation) For every reg-ular worker w ∈ R and any x ∈ R p , the variation ofits stochastic gradients with respect to its local gradient isupper-bounded by E i kw (cid:107) f (cid:48) w,i kw ( x ) − f (cid:48) w ( x ) (cid:107) ≤ σ , ∀ w ∈ R . (17) gradients with large variations gradients with small variations regular gradienttrue gradient Byzantine gradientgeometric median Figure 1: An illustration of the impact of inner and outervariations on geometric median under Byzantine attacks.Suppose that 8 regular workers sample stochastic gradi-ents from 4 distributions. When the variations are large, theByzantine-free true average is far away from the geometricmedian under Byzantine attacks. In contrast, when the vari-ations are small, the gap is small too.Assumption 1 is standard for analyzing distributed learn-ing algorithms. Assumptions 2 and 3 bound the inner varia-tion that describes the sample heterogeneity on every regularworker and the outer variation that describes the sample het-erogeneity among the regular workers, respectively. Whenall the samples at the regular workers are identical, both theinner variation and the outer variation are zero. When thedistributed data are i.i.d. the inner variation is often largerthan the outer variation, while for the non-i.i.d. case the outervariation usually dominates.

Concentration property of geometric median.

To under-stand the importance of variance reduction to robust aggre-gation, we show the concentration property of geometricmedian in the following lemma.

Lemma 1.

Let { z w , w ∈ W} be a subset of random vectorsdistributed in a normed vector space. It holds when B < W that E (cid:107) geomed ( { z w , w ∈ W} ) − ¯ z (cid:107) (18) ≤ C α R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + C α R (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) , where ¯ z := R (cid:80) w ∈R E z w , α := BW , and C α := − α − α . Suppose z w = g kw as deﬁned (4), such that z w is the truestochastic gradient when w is regular, and an arbitrary ma-licious message when w is Byzantine. The left-hand side of(18) denotes the mean-square error of the geometric medianrelative to the average of the true stochastic gradients. Theright-hand side of (18) contains two terms. The ﬁrst termnow refers to the overall inner variation, while the secondterm refers to the outer variation. When either the inner vari-ation or the outer variation is large, the geometric medianaggregation yields a poor direction that eventually leads tolarge learning error, as illustrated in Figure 1. This fact moti-vates Byrd-SAGA. It has been shown in (Wu et al. 2020) thatSAGA effectively eliminates the inner variation of the mes-sages sent by the regular workers. Therefore, if we replace w = v kw as deﬁned in (9), the ﬁrst term at the right-handside of (18) vanishes. However, for the non-i.i.d. case, thesecond term at the right-hand side of (18) can be large andstill deteriorate the performance of robust aggregation. Concentration property of geometric median after re-sampling.

Interestingly, augmented with the resamplingstrategy, geometric median shows better dependency on boththe inner variation and the outer variation, as shown in thefollowing lemma.

Lemma 2.

Let { z w , w ∈ W} be a subset of random vectorsdistributed in a normed vector space and the random vectorsin { z w , w ∈ R} are independent. Generate from { z w , w ∈W} a new set { ˜ z w , w ∈ W} using the resampling strategywith s -replacement. It holds when B < W s that E (cid:107) geomed ( { ˜ z w , w ∈ W} ) − ¯ z (cid:107) (19) ≤ (cid:18) d + 1 − dR (cid:19) C sα R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + dC sα R (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) , where ¯ z := R (cid:80) w ∈R E z w , α := BW , C sα := − sα − sα , and d := W − sW − . Comparing the bounds of mean-square errors given by(18) and (19), we observe that the coefﬁcients change. When s = 1 , the two bounds are identical. When s > , the coef-ﬁcients in (19) smaller than those in (18) if α is sufﬁcientlysmall, meaning that the inner variation and the outer varia-tion are simultaneously reduced. The cost, however, is toler-ating a smaller fraction of Byzantine workers. We will nu-merically depict the effect of variance reduction in Figure 2when discussing the main theorem.Note that (He, Karimireddy, and Jaggi 2020) analyzes thecombination of resampling and Krum. This analysis for thecombination of resampling and geometric median is new.Again, suppose z w = g kw as deﬁned in (4). Then, the right-hand side of (19) contains two terms, one for inner vari-ation and another for outer variation. As discussed above,the combination of distributed SGD with geometric medianand resampling can reduce the dependency on both the innerand outer variations comparing to that without resampling,but cannot eliminate the inner variation. We shall see in themain theorem that, the further introduction of SAGA in ourproposed method is able to eliminate the inner variation, andhence leads to reduced learning error. Main theorem.

The following main theorem establishes theconvergence property of the proposed method.

Theorem 1.

Under Assumptions 1 and 2, if the number ofByzantine workers satisﬁes

B < W s and the step size satis-ﬁes γ ≤ µ √ J L C sα , then the iterate x k generated by the proposed method in Al-gorithm 2 satisﬁes E (cid:107) x k − x ∗ (cid:107) ≤ (1 − γµ k ∆ + ∆ , (20) α L e a r n i n g E rr o r Byrd-SAGAOur proposed (s=2)Our proposed (s=3)Our proposed (s=4)

Figure 2: Learning error versus α , the fraction of Byzantineworkers, for Byrd-SAGA and the proposed method with dif-ferent s . We omit δ in the learning errors to compare C α with dC sα . The number of all workers is set as W = 30 . where ∆ := (cid:107) x − x ∗ (cid:107) − ∆ , (21) ∆ := 5 dC sα δ µ , (22) while α := BW , C sα := − sα − sα , and d := W − sW − . Theorem 1 shows that the proposed method reaches aneighborhood of the optimal solution at a linear convergencerate. The learning error ∆ in the order of O ( dC sα δ ) is de-termined by the outer variation, but irrelevant with the innervariation. In contrast, the learning error of Byrd-SAGA is inthe order of O ( C α δ ) when B < W . For the combinationof distributed SGD and geometric median (that we call asByrd-SGD) and the combination of distributed SGD, resam-pling and geometric median (that we call as RS-Byrd-SGD),the learning errors are in the orders of O ( C α σ + C α δ ) and O (( d + − dR ) C sα σ + dC sα δ ) when B < W and B < W s , respectively. Note that the analysis of Byrd-SAGAalso needs Assumptions 1 and 2, while that of Byrd-SGDand RS-Byrd-SGD need Assumptions 1, 2 and 3. We list allthe learning errors in Table 1.Algorithm Require. Learning ErrorByrd-SGD B < W O ( C α σ + C α δ ) RS-Byrd-SGD

B < W s O (( d + − dR ) C sα σ + dC sα δ ) Byrd-SAGA

B < W O ( C α δ ) Our Proposed

B < W s O ( dC sα δ ) Table 1: Requirements & learning errors of four algorithms.Figure 2 shows the dependency of learning error on α := BW for Byrd-SAGA and the proposed method with different s , which eliminate the impact of inner variation. We omitthe outer variation δ in the learning errors and compare C α with dC sα . The number of all workers is set as W = 30 ,hich is consistent with the setting in the numerical ex-periments. Observe that when α is small enough, the pro-posed method has smaller learning error than Byrd-SAGA.This fact depicts the advantage of resampling in reducingthe outer variation and handling the non-i.i.d. data. The side-effect of resampling is the tolerance to a smaller fraction ofByzantine workers. Larger s means tolerance to less Byzan-tine workers. According to the analysis, s = 2 achieves sat-isfactory tradeoff between the learning error and the tolera-ble fraction of Byzantine workers. Therefore, in the numer-ical experiments we set s = 2 , which is also recommendedby (He, Karimireddy, and Jaggi 2020).Note that (5) has no closed-form solution, and must besolved with iterative algorithms. Therefore, we often con-sider (cid:15) -approximate geometric median, allowing the com-puted vector and the true geometric median has a gap of (cid:15) .For simplicity, in the analysis above we assume (cid:15) = 0 . In thesupplementary material, we also consider the case of (cid:15) (cid:54) = 0 . Numerical Experiments

This section presents numerical experiments to demonstratethe robustness of our proposed method. We consider bothconvex and nonconvex distributed learning problems. Forthe convex problem, we focus on the softmax regression onthe MNIST dataset. For the nonconvex problem, we traintwo-layer perceptrons, in which each layer has 50 neuronsand the activation function is ‘tanh’, on the MINST andCOVTYPE datasets. The attributes of the datasets are de-scribed in Table 2. In the i.i.d. case, we use the MNISTdataset, launch 1 master node and W = 30 workers, andlet the data evenly distributed across all workers. In the non-i.i.d. case, we ﬁrst use the MNIST dataset, launch 1 masternode and W = 30 workers, and let every three workers toevenly share the data from one class. We then use the COV-TYPE dataset, launch 1 master node and W = 21 work-ers, and also let every three workers to evenly share the datafrom one class. The numerical experiments are conductedon a server with two Intel(R) Xeon(R) Silver 4216 CPUsand four GeForce RTX 2080 GPUs.Name Train Test Dimensions ClassesMNIST 60000 10000 784 10COVTYPE 11340 565892 54 7Table 2: Attributes of the MNIST and COVTYPE datasets. Benchmark methods

We compare our proposed method with several benchmarks.

Distributed SGD.

The distributed SGD aggregates the re-ceived messages by returning the mean, and hence has norobustness against Byzantine attacks.

Byrd-SGD.

Byrd-SGD aggregates the received messages byreturning the geometric median, as shown in (6).

RS-Byrd-SGD.

RS-Byrd-SGD ﬁrst resamples the receivedmessages with s -replacement, and then aggregates the re-sults by returning the geometric median. http://yann.lecun.com/exdb/mnist/ https://archive.ics.uci.edu/ml/datasets/covertype Krum.

Krum aggregates the received messages by returningthe one that has the smallest summation of squared distancesto its W − B − nearest neighbors, given by Krum( { z w , w ∈ W} ) = z w ∗ ,w ∗ = arg min w ∈W (cid:88) w → w (cid:48) (cid:107) z w − z w (cid:48) (cid:107) . Here w → w (cid:48) selects the indexes w (cid:48) of the W − B − nearest neighbors of z w in { z w , w ∈ W} . Note that Krumneeds to know the number of Byzantine workers in advance. Byrd-SAGA.

Byrd-SAGA aggregates the received mes-sages by returning the geometric median, but the regularworkers returns the corrected stochastic gradients, not thestochastic gradients in the above benchmark methods.

RSA.

RSA is based on model aggregation, not stochasticgradient aggregation in the above benchmark methods. Themaster node and every worker w maintains iterates x k and x kw , respectively. The update rule of the master node is x k +10 = x k − γ · λ (cid:88) w ∈W sign( x k − x kw ) , while the update rule of regular worker w ∈ R is x k +1 i = x ki − γ · (cid:16) f (cid:48) w,i kw ( x kw ) + λ sign( x kw − x k ) (cid:17) . Here sign is the element-wise sign function and λ > is thepenalty parameter.In the numerical experiments, the batch size is set as 32.When using resampling, we set s = 2 . All parameters in thebenchmark methods are hand-tuned to the best. Byzantine attacks

We consider the following Byzantine attacks.

Sign-ﬂipping attacks.

Every Byzantine worker computesits true message, multiplies it with a constant c < , andsends to the master node. Here we set c = − . Gaussian attacks.

Every Byzantine worker sends messages,whose elements follow Gaussian distribution N (0 , ,to the master node. Sample-duplicating attacks.

The Byzantine workers col-lude to choose a speciﬁc regular worker, and duplicate itsmessage at every time. This amounts to that the Byzantineworkers duplicate the data samples of the chosen regularworker. We only apply the sample-duplicating attacks in thenon-i.i.d. case.

Robustness in i.i.d. case

We carry out numerical experiments in softmax regressionon the MNIST dataset. The step size of our proposed methodis γ = 0 . . When there exist Byzantine attacks, we uni-formly randomly select B = 6 Byzantine workers.

Without Byzantine attacks.

When there exist no Byzantineattacks, all the methods are able to achieve satisfactory clas-siﬁcation accuracies, as illustrated in Figure 3.

Sign-ﬂipping attacks.

As depicted in Figure 4, the dis-tributed SGD is vulnerable and fails under the sign-ﬂippingattacks. The other methods all perform well.

Gaussian attacks.

The results of Gaussian attacks areshown in Figure 5. The distributed SGD also fails, while theother methods have robustness to Gaussian attacks. C l a ss i f i c a t i o n A cc u r a c y Distributed SGDByrd-SGDRS-Byrd-SGDRSAKrumByrd-SAGAOur proposed

Figure 3: Without Byzantine attacks oni.i.d. MNIST data in softmax regression. C l a ss i f i c a t i o n A cc u r a c y Distributed SGDByrd-SGDRS-Byrd-SGDRSAKrumByrd-SAGAOur proposed

Figure 4: With sign-ﬂipping attacks oni.i.d. MNIST data in softmax regression. C l a ss i f i c a t i o n A cc u r a c y Distributed SGDByrd-SGDRS-Byrd-SGDRSAKrumByrd-SAGAOur proposed

Figure 5: With Gaussian attacks on i.i.d.MNIST data in softmax regression. C l a ss i f i c a t i o n A cc u r a c y Distributed SGDByrd-SGDRS-Byrd-SGDRSAKrumByrd-SAGAOur proposed

Figure 6: With sample-duplicating at-tacks on non-i.i.d. MNIST data in soft-max regression. C l a ss i f i c a t i o n A cc u r a c y Byrd-SGDRS-Byrd-SGDByrd-SAGAOur proposed

Figure 7: With sample-duplicating at-tacks on non-i.i.d. MNIST data in neuralnetwork training. C l a ss i f i c a t i o n A cc u r a c y Byrd-SGDRS-Byrd-SGDByrd-SAGAOur proposed

Figure 8: With sample-duplicating at-tacks on non-i.i.d. COVTYPE data inneural network training.

Robustness in non-i.i.d. case

Sample-duplicating attacks on MNIST in softmax re-gression.

We choose B = 6 workers that originally sharethe samples of two classes as Byzantine workers. Therefore,these two classes essentially disappear under the sample-duplication attacks, such that the best possible accuracy is0.8. For the regular worker that is chosen to be duplicated,its class now has 9 workers out of the total W = 30 . Thestep size of our proposed method is γ = 0 . . As shown inFigure 5, our proposed method is able to achieve the classiﬁ-cation accuracy of around 0.73. RS-Byrd-SGD also demon-strates robustness since the resampling strategy helps reducethe impact of both inner and outer variations. However, aswe have shown in the theoretical analysis, solely using re-sampling is unable to fully eliminate the inner variation. Incontrast, the proposed method introduces SAGA to addressthis issue, yielding enhanced classiﬁcation accuracy. RSAworks well since it is developed to handle the non-i.i.d. case.The other Byzantine-robust methods, Byrd-SGD, Krum andByrd-SAGA, all show degraded performance since they areessentially designed for distributed i.i.d. data. Sample-duplicating attacks on MNIST in neural net-work training.

We choose B = 3 workers that originallyshare the samples of one class as Byzantine workers, suchthat the best possible accuracy is 0.9 under the sample-duplicating attacks. The step size of our proposed methodis γ = 0 . . We compare Byrd-SGD, RS-Byrd-SGD, Byrd-SAGA and our proposed method in Figure 7. Byrd-SGDfails, but Byrd-SAGA is able to attain favorable classiﬁca-tion accuracy since the ratio of Byzantine workers is only . Among the two methods developed for the non-i.i.d.case, our proposed method outperforms RS-Byrd-SGD. The performance gain of Byrd-SAGA over Byrd-SGD and thatof the proposed method over RS-Byrd-SGD are both due tothe elimination of inner variation. Sample-duplicating attacks on COVTYPE in neural net-work training.

We choose B = 3 workers that originallyshare the samples of one class as Byzantine workers, suchthat the best possible accuracy is 0.86 under the sample-duplicating attacks. The step size of our proposed method is γ = 0 . . As shown in Figure 8, Byrd-SGD is still the worst.However, Byrd-SAGA and RS-Byrd-SGD are both remark-ably outperformed by our proposed method, since the ratioof Byzantine workers is now raised to . Therefore, elim-inating the inner variation and reducing the outer variationare of particular importance as a larger portion of receivedmessages are Byzantine. Conclusions

We develop a Byzantine-robust variance-reduced method todeal with the ﬁnite-sum minimization problem with dis-tributed non-i.i.d. data. To reduce the impact of stochasticgradient noise that hinders the resistance to Byzantine at-tacks, we adopt the resampling strategy and SAGA to re-duce the outer variation and fully eliminate the inner vari-ation. The variance-reduced messages are then aggregatedby geometric median. Theoretical results show that the pro-posed method can reach a neighborhood of the optimal so-lution with linear convergence rate and the learning error isdetermined by the number of Byzantine workers. Numeri-cal experiments demonstrate the robustness of our proposedmethod. In the future work, we will investigate the elimi-nation of outer variation, and the combination of variance-reduced methods with other robust aggregation rules. eferences

Azulay, S.; Raz, L.; Globerson, A.; Koren, T.; and Afek, Y.2020. Holdout SGD: Byzantine Tolerant Federated Learn-ing. arXiv preprint arXiv:2008.04612 .Blanchard, P.; Guerraoui, R.; Stainer, J.; et al. 2017. Ma-chine Learning with Adversaries: Byzantine Tolerant Gradi-ent Descent. In

Advances in Neural Information ProcessingSystems , 119–129.Bottou, L. 2010. Large-Scale Machine Learning withStochastic Gradient Descent. In

International Conferenceon Computational Statistics , 177–186.Chen, Y.; Su, L.; and Xu, J. 2017. Distributed Statistical Ma-chine Learning in Adversarial Settings: Byzantine GradientDescent.

ACM on Measurement and Analysis of ComputingSystems arXivpreprint arXiv:2006.09992 .El Mhamdi, E. M.; Guerraoui, R.; and Rouault, S. 2020. Dis-tributed Momentum for Byzantine-resilient Learning. arXivpreprint arXiv:2003.00010 .El Mhamdi, E. M.; Guerraoui, R.; and Rouault, S. L. A.2018. The Hidden Vulnerability of Distributed Learning inByzantium. In

International Conference on Machine Learn-ing , 3521–3530.Ghosh, A.; Hong, J.; Yin, D.; and Ramchandran, K. 2019.Robust Federated Learning in a Heterogeneous Environ-ment. arXiv preprint arXiv:1906.06629 .He, L.; Karimireddy, S. P.; and Jaggi, M. 2020. Byzantine-Robust Learning on Heterogeneous Datasets via Resam-pling. arXiv preprint arXiv:2006.09365 .Kairouz, P.; McMahan, H. B.; Avent, B.; Bellet, A.; Bennis,M.; Bhagoji, A. N.; Bonawitz, K.; Charles, Z.; Cormode, G.;Cummings, R.; et al. 2019. Advances and Open Problems inFederated Learning. arXiv preprint arXiv:1912.04977 .Khanduri, P.; Bulusu, S.; Sharma, P.; and Varshney, P. K.2019. Byzantine Resilient Non-Convex SVRG with Dis-tributed Batch Gradient Computations. arXiv preprintarXiv:1912.04531 .Koneˇcn`y, J.; McMahan, H. B.; Ramage, D.; and Richt´arik,P. 2016a. Federated Optimization: Distributed MachineLearning for On-Device Intelligence. arXiv preprintarXiv:1610.02527 .Koneˇcn`y, J.; McMahan, H. B.; Yu, F. X.; Richt´arik, P.;Suresh, A. T.; and Bacon, D. 2016b. Federated Learning:Strategies for Improving Communication Efﬁciency. arXivpreprint arXiv:1610.05492 .Lamport, L.; Shostak, R. E.; and Pease, M. C. 1982. TheByzantine Generals Problem.

ACM Transactions on Pro-gramming Languages and Systems

AAAI Conference on Artiﬁcial Intelligence ,1544–1551.Li, S.; Cheng, Y.; Wang, W.; Liu, Y.; and Chen, T. 2020a.Learning to Detect Malicious Clients for Robust FederatedLearning. arXiv preprint arXiv:2002.00211 .Li, T.; Sahu, A. K.; Talwalkar, A.; and Smith, V. 2020b.Federated Learning: Challenges, Methods, and Future direc-tions.

IEEE Signal Processing Magazine arXiv preprint arXiv:1602.05629 .Rodr´ıguez-Barroso, N.; Mart´ınez-C´amara, E.; Luz´on, M.;Seco, G. G.; Veganzones, M. ´A.; and Herrera, F. 2020. Dy-namic Federated Learning Model for Identifying Adversar-ial Clients. arXiv preprint arXiv:2007.15030 .Wu, Z.; Ling, Q.; Chen, T.; and Giannakis, G. B. 2020. Fed-erated Variance-Reduced Stochastic Gradient Descent withRobustness to Byzantine Attacks.

IEEE Transactions onSignal Processing

68: 4583–4596.Xie, C.; Koyejo, O.; and Gupta, I. 2020. Fall of Empires:Breaking Byzantine-Tolerant SGD by Inner Product Manip-ulation. In

Uncertainty in Artiﬁcial Intelligence , 261–270.Yang, Q.; Liu, Y.; Chen, T.; and Tong, Y. 2019. FederatedMachine Learning: Concept and Applications.

ACM Trans-actions on Intelligent Systems and Technology

IEEE Signal ProcessingMagazine

37: 146–159.Yin, D.; Chen, Y.; Kannan, R.; and Bartlett, P. 2018.Byzantine-Robust Distributed Learning: Towards OptimalStatistical Rates. In

International Conference on MachineLearning , 5650–5659.Zhao, Y.; Li, M.; Lai, L.; Suda, N.; Civin, D.; and Chan-dra, V. 2018. Federated Learning with Non-iid Data. arXivpreprint arXiv:1806.00582 . upplementary Material forByzantine-Robust Variance-Reduced Federated Learning over Distributed Non-i.i.d. DataProof of Lemma 1 To prove Lemma 1, we review the following supporting lemma.

Lemma 3. (Wu et al. 2020, Lemmas 2 and 3) Let { z w , w ∈ W} be a subset of random vectors distributed in a normed vectorspace. It holds when B < W that E (cid:107) geomed ( { z w , w ∈ W} ) (cid:107) ≤ C α R (cid:88) w ∈R E (cid:107) z w (cid:107) , (23) where α := BW and C α := − α − α . Deﬁne z ∗ (cid:15) is an (cid:15) -approximate geometric median of { z w , w ∈ W} . It holds when B < W that E (cid:107) z ∗ (cid:15) (cid:107) ≤ C α R (cid:88) w ∈R E (cid:107) z w (cid:107) + 2 (cid:15) ( W − B ) . (24) Lemma 4. (Full version of Lemma 1) Let { z w , w ∈ W} be a subset of random vectors distributed in a normed vector space.It holds when B < W that E (cid:107) geomed ( { z w , w ∈ W} ) − ¯ z (cid:107) ≤ C α R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + C α R (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) , (25) where ¯ z := R (cid:80) w ∈R E z w , α := BW , and C α := − α − α . Deﬁne z ∗ (cid:15) as an (cid:15) -approximate geometric median of { z w , w ∈ W} . Itholds when B < W that E (cid:107) z ∗ (cid:15) − ¯ z (cid:107) ≤ C α R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + 2 C α R (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) + 2 (cid:15) ( W − B ) . (26) Proof.

For simplicity, we only prove (25). With Lemma 3, it holds that E (cid:107) geomed ( { z w , w ∈ W} ) − ¯ z (cid:107) (27) = E (cid:107) geomed ( { z w − ¯ z, w ∈ W} ) (cid:107) ≤ C α R (cid:88) w ∈R E (cid:107) z w − ¯ z (cid:107) = C α R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + C α R (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) , where the last equality uses the decomposition R (cid:88) w ∈R E (cid:107) z w − ¯ z (cid:107) = 1 R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + 1 R (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) . (28)This completes the proof.Note that the ﬁrst part of Lemma 4 has a similar form as Lemma 1 in (Wu et al. 2020), but the coefﬁcients are reduced by afactor of 2 due to the proper use of the decomposition in (28). Proof of Lemma 2

We begin with reviewing the following supporting lemma.

Lemma 5. (He, Karimireddy, and Jaggi 2020, Proposition 1) Let { z w , w ∈ W} be a subset of vectors distributed in a normedvector space, and generate from it a new set { ˜ z w , w ∈ W} using the resampling strategy with s -replacement. When B < Ws ,there exists a set R (cid:48) ⊆ W with at least W − sB elements, such that for any w (cid:48) ∈ R (cid:48) it holds that E ˜ z w (cid:48) = 1 R (cid:88) w ∈R z w , (29) (cid:107) ˜ z w (cid:48) − E ˜ z w (cid:48) (cid:107) = dR (cid:88) w ∈R (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) z w − R (cid:88) u ∈R z u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , (30) where d := W − sW − and the expectation E is taken for the resampling process. Lemma 5 shows that after resampling, the expectations of at least W − sB elements are close to the average of { z w , w ∈ R} .The variance E (cid:107) ˜ z w − E ˜ z w (cid:107) reduces by a factor of d relative to the variation R (cid:80) w ∈R (cid:107) z w − R (cid:80) w ∈R z w (cid:107) . Note that d isclose to s if W is sufﬁciently large. We further investigate the properties of these W − sB elements in the following lemma. Lemma 6.

Let { z w , w ∈ W} be a subset of random vectors distributed in a normed vector space and the random vectorsin { z w , w ∈ R} are independent. Generate from { z w , w ∈ W} a new set { ˜ z w , w ∈ W} using the resampling strategy with s -replacement. When B < Ws , there exists a set R (cid:48) ⊆ W with at least W − sB elements, such that for any w (cid:48) ∈ R (cid:48) , (29) and (30) hold true, and |R (cid:48) | (cid:88) w (cid:48) ∈R (cid:48) E (cid:107) ˜ z w (cid:48) − ¯ z (cid:107) ≤ (cid:18) d + 1 − dR (cid:19) R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + dR (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) , (31) where d := W − sW − and ¯ z := R (cid:80) w ∈R E z w .Proof. We decompose the term |R (cid:48) | (cid:80) w (cid:48) ∈R (cid:48) E (cid:107) ˜ z w (cid:48) − ¯ z (cid:107) into two terms |R (cid:48) | (cid:88) w (cid:48) ∈R (cid:48) E (cid:107) ˜ z w (cid:48) − ¯ z (cid:107) (32) = 1 |R (cid:48) | (cid:88) w (cid:48) ∈R (cid:48) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ z w (cid:48) − R (cid:88) w ∈R z w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) R (cid:88) w ∈R z w − ¯ z (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ dR (cid:88) w ∈R E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) z w − R (cid:88) u ∈R z u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) R (cid:88) w ∈R z w − ¯ z (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) , where the inequality comes from Lemma 5 that for any w (cid:48) ∈ R (cid:48) it holds E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ˜ z w (cid:48) − R (cid:88) w ∈R z w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E (cid:107) ˜ z w (cid:48) − E ˜ z w (cid:48) (cid:107) ≤ dR (cid:88) w ∈R E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) z w − R (cid:88) u ∈R z u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . (33)The ﬁrst term at the right-hand side of (32) can be further decomposed into three terms R (cid:88) w ∈R E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) z w − R (cid:88) u ∈R z u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (34) = 1 R (cid:88) w ∈R E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( z w − E z w ) + (cid:32) E z w − R (cid:88) u ∈R E z u (cid:33) + (cid:32) R (cid:88) u ∈R E z u − R (cid:88) u ∈R z u (cid:33)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + 1 R (cid:88) w ∈R (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E z w − R (cid:88) u ∈R E z u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) R (cid:88) w ∈R z w − R (cid:88) w ∈R E z w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 2 R (cid:88) w ∈R E (cid:42) z w − E z w , E z w − R (cid:88) u ∈R E z u (cid:43)(cid:124) (cid:123)(cid:122) (cid:125) T + 2 R (cid:88) w ∈R E (cid:42) E z w − R (cid:88) u ∈R E z u , R (cid:88) u ∈R E z u − R (cid:88) u ∈R z u (cid:43)(cid:124) (cid:123)(cid:122) (cid:125) T + 2 R (cid:88) w ∈R E (cid:42) z w − E z w , R (cid:88) u ∈R E z u − R (cid:88) u ∈R z u (cid:43)(cid:124) (cid:123)(cid:122) (cid:125) T = 1 R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + 1 R (cid:88) w ∈R (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E z w − R (cid:88) u ∈R E z u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) R (cid:88) w ∈R z w − R (cid:88) w ∈R E z w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . o see how to reach the last equality, we check the three cross terms. The ﬁrst cross term can be cancelled by T = 2 R (cid:88) w ∈R E (cid:42) z w − E z w , E z w − R (cid:88) u ∈R E z u (cid:43) (35) = 2 R (cid:88) w ∈R (cid:42) E z w − E z w (cid:124) (cid:123)(cid:122) (cid:125) =0 , E z w − R (cid:88) u ∈R E z u (cid:43) =0 . Similarly, the second cross term can be cancelled by T = 2 R (cid:88) w ∈R E (cid:42) E z w − R (cid:88) u ∈R E z u , R (cid:88) u ∈R E z u − R (cid:88) u ∈R z u (cid:43) (36) = 2 R (cid:88) w ∈R (cid:42) E z w − R (cid:88) u ∈R E z u , R (cid:88) u ∈R E z u − R (cid:88) u ∈R E z u (cid:124) (cid:123)(cid:122) (cid:125) =0 (cid:43) =0 . The third cross term equals to T = 2 R (cid:88) w ∈R E (cid:42) z w − E z w , R (cid:88) u ∈R E z u − R (cid:88) u ∈R z u (cid:43) (37) =2 E (cid:42) R (cid:88) w ∈R z w − R (cid:88) w ∈R E z w , R (cid:88) u ∈R E z u − R (cid:88) u ∈R z u (cid:43) = − E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) R (cid:88) w ∈R z w − R (cid:88) w ∈R E z w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Substituting (34) into (32) and using the deﬁnition ¯ z := R (cid:80) w ∈R E z w , we have |R (cid:48) | (cid:88) w (cid:48) ∈R (cid:48) E (cid:107) ˜ z w (cid:48) − ¯ z (cid:107) (38) ≤ dR (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + dR (cid:88) w ∈R (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E z w − R (cid:88) u ∈R E z u (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (1 − d ) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) R (cid:88) w ∈R z w − R (cid:88) w ∈R E z w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = dR (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + dR (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) + 1 − dR (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) = (cid:18) d + 1 − dR (cid:19) R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + dR (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) . The ﬁrst equality in (38) comes from E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) R (cid:88) w ∈R z w − R (cid:88) w ∈R E z w (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) R (cid:88) w ∈R ( z w − E z w ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (39) = 1 R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + (cid:88) w,u ∈R w (cid:54) = u E (cid:28) R ( z w − E z w ) , R ( z u − E z u ) (cid:29) = 1 R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) , where the last equality is due to the fact that the random vectors in { z w , w ∈ R} are independent. This completes the proof.ow we compare the two bounds (28) and (31). The left-hand sides of (28) and (31) are the mean-square errors of { z w , w ∈R} and { ˜ z w (cid:48) , w (cid:48) ∈ R (cid:48) } relative to ¯ z , respectively. Since d + − dR ≤ and d ≤ , we see that the bias of { ˜ z w (cid:48) , w (cid:48) ∈ R (cid:48) } to ¯ z issmaller than the bias of { z w , w ∈ R} to ¯ z , showing the “variance reduction” property of resampling. Lemma 7. (Full version of Lemma 2) Let { z w , w ∈ W} be a subset of random vectors distributed in a normed vector spaceand the random vectors in { z w , w ∈ R} are independent. Generate from { z w , w ∈ W} a new set { ˜ z w , w ∈ W} using theresampling strategy with s -replacement. It holds when B < W s that E (cid:107) geomed ( { ˜ z w , w ∈ W} ) − ¯ z (cid:107) ≤ (cid:18) d + 1 − dR (cid:19) C sα R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + dC sα R (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) , (40) where ¯ z := R (cid:80) w ∈R E z w , α := BW , C sα := − sα − sα , and d := W − sW − . Deﬁne ˜ z ∗ (cid:15) as an (cid:15) -approximate geometric median of { ˜ z w , w ∈ W} . It holds when B < W s that E (cid:107) ˜ z ∗ (cid:15) − ¯ z (cid:107) ≤ (cid:18) d + 1 − dR (cid:19) C sα R (cid:88) w ∈R E (cid:107) z w − E z w (cid:107) + 2 dC sα R (cid:88) w ∈R (cid:107) E z w − ¯ z (cid:107) + 2 (cid:15) ( W − sB ) . (41) Proof.

For simplicity, we only prove (40). As Lemma 5 claims, when

B < Ws , there exists a set R (cid:48) ⊆ W with at least W − sB elements, such that for any w (cid:48) ∈ R (cid:48) , (29) and (30) hold true. When B < W s , |R (cid:48) | W > . With Lemma 4, it holds that E (cid:107) geomed ( { ˜ z w , w ∈ W} ) − ¯ z (cid:107) = E (cid:107) geomed ( { ˜ z w − ¯ z, w ∈ W} ) (cid:107) ≤ C sα |R (cid:48) | (cid:88) w (cid:48) ∈R (cid:48) E (cid:107) ˜ z w (cid:48) − ¯ z (cid:107) . (42)Applying Lemma 6 completes the proof immediately. Proof of Theorem 1

To prove Theorem 1, we review the following supporting lemma for SAGA.

Lemma 8. (Wu et al. 2020, Lemmas 4 and 5) Under Assumption 1, if all regular workers w ∈ W update φ kw,i kw and v kw according to (8) and (9) , then the corrected stochastic gradient v kw satisﬁes E (cid:107) v kw − f (cid:48) w ( x k ) (cid:107) ≤ L J J (cid:88) j =1 (cid:107) x k − φ kw,j (cid:107) , ∀ w ∈ W , (43) and R (cid:88) w ∈R E (cid:107) v kw − f (cid:48) w ( x k ) (cid:107) ≤ L S k , (44) where S k is deﬁned as S k := 1 R (cid:88) w ∈R J J (cid:88) j =1 (cid:107) x k − φ kw,j (cid:107) . (45) Further, S k satisﬁes E S k +1 ≤ J E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) + 4 Jγ L (cid:107) x k − x ∗ (cid:107) + (1 − J ) S k . (46) Theorem 2. (Full version of Theorem 1) Under Assumptions 1 and 2, if the number of Byzantine workers satisﬁes

B < W s andthe step size satisﬁes γ ≤ µ √ J L C sα , then the iterate x k generated by the proposed method in Algorithm 2 satisﬁes E (cid:107) x k − x ∗ (cid:107) ≤ (1 − γµ k ∆ + ∆ , (47) here ∆ := (cid:107) x − x ∗ (cid:107) − ∆ , (48) ∆ := 5 dC sα δ µ , (49) while α := BW , C sα := − sα − sα , and d := W − sW − . On the other hand, when the step size satisﬁes γ ≤ µ √ J L C sα , the iterate x k generated by the proposed method in Algorithm 2 with (cid:15) -approximate geometric median aggregation satisﬁes E (cid:107) x k − x ∗ (cid:107) ≤ (1 − γµ k ∆ + 2∆ + 10 (cid:15) µ ( W − sB ) . (50) Proof.

For simplicity, we only prove (47). According to the proof of Theorem 1 in (Wu et al. 2020), when γ ≤ µ L , (51)it holds that E (cid:107) x k +1 − x ∗ (cid:107) ≤ (1 − γµ ) (cid:107) x k − x ∗ (cid:107) + 2 γµ E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) . (52)Then, we construct a Lyapunov function T k as T k := (cid:107) x k − x ∗ (cid:107) + c S k , (53)where c is any positive constant. Since S k is non-negative according to the deﬁnition in (45), we know that T k is also non-negative. Substituting (46) and (52) into (53) yields E T k +11 ≤ (1 − γµ + 4 c Jγ L ) (cid:107) x k − x ∗ (cid:107) + ( 2 γµ + 4 c J ) E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) + (1 − J ) c S k . (54)Note that the second term at the right-hand side of (54) can be bounded with the help of Lemma 7, as E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) (55) = γ E (cid:107) geomed (cid:0) { ˜ v kw , w ∈ W} (cid:1) − f (cid:48) ( x k ) (cid:107) ≤ γ (cid:18) d + 1 − dR (cid:19) C sα R (cid:88) w ∈R E (cid:107) v kw − f (cid:48) w ( x k ) (cid:107) + γ dC sα R (cid:88) w ∈R (cid:107) f (cid:48) w ( x k ) − f (cid:48) ( x k ) (cid:107) ≤ γ (cid:18) d + 1 − dR (cid:19) C sα L S k + γ dC sα δ , where the last inequality comes from Lemma 8 and Assumption 2. Therefore, we have E T k +11 ≤ (1 − γµ + 4 c Jγ L ) (cid:107) x k − x ∗ (cid:107) (56) + (cid:18)(cid:18) − J (cid:19) c + (cid:18) γµ + 4 c J (cid:19)(cid:18) d + 1 − dR (cid:19) C sα γ L (cid:19) S k + γ (cid:18) γµ + 4 c J (cid:19) dC α δ . If we constrain the step size γ as c Jγ L ≤ γµ , (57)then the coefﬁcients in (56) satisfy − γµ + 4 c Jγ L ≤ − γµ , (58) γµ + 4 c J ≤ γµ + µ γL ≤ γµ . (59)herefore, we can bound E T k +1 by E T k +11 ≤ (1 − γµ (cid:107) x k − x ∗ (cid:107) + (cid:18)(cid:18) − J (cid:19) c + 5 γ µ (cid:18) d + 1 − dR (cid:19) C sα L (cid:19) S k + 5 γ µ dC α δ . (60)Similarly, if γ and c are chosen such that γµ < J , (61)and c = 5 J γL ( d + (1 − d ) /R ) C sα µ ≥ γL ( d + (1 − d ) /R ) C sα / µ (1 /J − γµ/ , (62)then the coefﬁcient in (60) satisﬁes (cid:18) − J (cid:19) c + 5 γ µ (cid:18) d + 1 − dR (cid:19) C sα L ≤ (1 − γµ c . (63)Hence, (60) becomes E T k +11 ≤ (1 − γµ (cid:107) x k − x ∗ (cid:107) + (1 − γµ c S k + 5 γ µ dC sα δ =(1 − γµ T k + 5 γ µ dC sα δ . (64)Using telescopic cancellation on (64) from time 1 to time k , we have E T k ≤ (1 − γµ k (cid:20) T − ∆ (cid:21) + ∆ . (65)Here and thereafter, the expectation is taken over ξ ki for all regular workers i ∈ R and time t ≤ k − . The deﬁnition of the Lyapunov function in (53) implies that E (cid:107) x k − x ∗ (cid:107) ≤ E T k ≤ (1 − γµ k ∆ + ∆ . (66)In our derivation so far, the constraint on the step size γ (c.f. (51), (57) and (61)) is γ ≤ min (cid:26) µ L , µ √ J / L C sα ( d + (1 − d ) /R ) / , J µ (cid:27) . (67)Therefore, we simply choose γ ≤ µ √ J L C sα , (68)which completes the proof. Convergence Property of Byrd-SAGA

The following Theorem establishes the convergence property of Byrd-SAGA. It is close to Theorem 1 in (Wu et al. 2020), butthe bound is tighter due to the use of the improved inequality given by Lemma 4.

Theorem 3.

Under Assumptions 1 and 2, if the number of Byzantine workers satisﬁes

B < W and the step size satisﬁes γ ≤ µ √ J L C α , then the iterate x k generated by Byrd-SAGA satisﬁes E (cid:107) x k − x ∗ (cid:107) ≤ (1 − γµ k ∆ + ∆ , (69) where ∆ := (cid:107) x − x ∗ (cid:107) − ∆ , (70) := 5 µ C α δ . (71) while α := BW and C α := − α − α . On the other hand, when the step size satisﬁes γ ≤ µ √ J L C α , the iterate x k generated by the proposed method in Algorithm 2 with (cid:15) -approximate geometric median aggregation satisﬁes E (cid:107) x k − x ∗ (cid:107) ≤ (1 − γµ ) k ∆ + 2∆ + 10 (cid:15) µ ( W − B ) . (72) Proof.

For simplicity, we only prove (69). Construct a

Lyapunov function T k as T k := (cid:107) x k − x ∗ (cid:107) + c S k . (73)According to the proof of Theorem 1 in (Wu et al. 2020), when γ ≤ µ L , (74)it holds that E T k +12 ≤ (1 − γµ + 4 c Jγ L ) (cid:107) x k − x ∗ (cid:107) + ( 2 γµ + 4 c J ) E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) + (1 − J ) c S k . (75)Note that the second term at the right-hand side of (75) can be bounded with the help of Lemma 4, as E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) (76) = γ E (cid:107) geomed (cid:0) { v kw , w ∈ W} (cid:1) − f (cid:48) ( x k ) (cid:107) ≤ γ C α R (cid:88) w ∈R E (cid:107) v kw − f (cid:48) w ( x k ) (cid:107) + γ C α R (cid:88) w ∈R (cid:107) f (cid:48) w ( x k ) − f (cid:48) ( x k ) (cid:107) ≤ γ C α L S k + γ C α δ , where the last inequality comes from Lemma 8 and Assumption 2. Therefore, we have E T k +12 ≤ (1 − γµ + 4 c Jγ L ) (cid:107) x k − x ∗ (cid:107) (77) + (cid:18)(cid:18) − J (cid:19) c + (cid:18) γµ + 4 c J (cid:19) C α γ L (cid:19) S k + γ (cid:18) γµ + 4 c J (cid:19) C α δ . If we constrain the step size γ as c Jγ L ≤ γµ , (78)then the coefﬁcients in (77) satisfy − γµ + 4 c Jγ L ≤ − γµ , (79) γµ + 4 c J ≤ γµ + µ γL ≤ γµ . (80)Therefore, we can bound E T k +1 by E T k +12 ≤ (1 − γµ (cid:107) x k − x ∗ (cid:107) + (cid:18)(cid:18) − J (cid:19) c + 5 γ µ C α L (cid:19) S k + 5 γ µ C α δ . (81)Similarly, if γ and c are chosen such that γµ < J , (82)and c = 5 J γL C α µ ≥ γL C α / µ (1 /J − γµ/ , (83)hen the coefﬁcient in (81) satisﬁes (cid:18) − J (cid:19) c + 5 γ µ C α L ≤ (1 − γµ c . (84)Hence, (81) becomes E T k +12 ≤ (1 − γµ (cid:107) x k − x ∗ (cid:107) + (1 − γµ c S k + 5 γ µ C α δ (85) =(1 − γµ T k + 5 γ µ C α δ . Using telescopic cancellation on (85) from time 1 to time k , we have E T k ≤ (1 − γµ k (cid:20) T − ∆ (cid:21) + ∆ . (86)Here and thereafter, the expectation is taken over i kw for all regular workers w ∈ R and time t ≤ k − . The deﬁnition of the Lyapunov function in (73) implies that E (cid:107) x k − x ∗ (cid:107) ≤ E T k ≤ (1 − γµ k ∆ + ∆ . (87)In our derivation so far, the constraint of the step size γ (c.f. (74) (78) and (82)) is γ ≤ min (cid:26) µ L , µ √ J / L C α , J µ (cid:27) . (88)Therefore, we simply choose γ ≤ µ √ J L C α , (89)which completes the proof. Convergence Property of Byrd-SGD

The following Theorem establishes the convergence property of Byrd-SGD. Again, it is close to Theorem 2 in (Wu et al. 2020),but the bound is tighter due to the use of the improved inequality given by Lemma 4.

Theorem 4.

Under Assumptions 1, 2 and 3, if the number of Byzantine workers satisﬁes

B < W and the step size satisﬁes γ ≤ µ L , then the iterate x k generated by Byrd-SGD satisﬁes E (cid:107) x k − x ∗ (cid:107) ≤ (1 − γµ ) k ∆ + ∆ , (90) where ∆ := (cid:107) x − x ∗ (cid:107) − ∆ , (91) ∆ := 2 µ (cid:18) C α σ + C α δ (cid:19) , (92) while α := BW and C α := − α − α . On the other hand, the iterate x k generated by Byrd-SGD with (cid:15) -approximate geometricmedian aggregation satisﬁes E (cid:107) x k − x ∗ (cid:107) ≤ (1 − γµ ) k ∆ + 2∆ + 4 (cid:15) µ ( W − B ) . (93) Proof.

For simplicity, we only prove (90). Note that inequality (52) is still true for Byrd-SGD with γ < µ L . The only differenceis that we need to bound the term E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) with Lemma 4, as E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) (94) = γ E (cid:107) geomed (cid:0) { g kw , w ∈ W} (cid:1) − f (cid:48) ( x k ) (cid:107) ≤ γ C α R (cid:88) w ∈R E (cid:107) f (cid:48) w,i kw ( x k ) − f (cid:48) w ( x k ) (cid:107) + γ C α R (cid:88) w ∈R (cid:107) f (cid:48) w ( x k ) − f (cid:48) ( x k ) (cid:107) ≤ γ C α σ + γ C α δ . herefore, we have E (cid:107) x k +1 − x ∗ (cid:107) ≤ (1 − γµ ) (cid:107) x k − x ∗ (cid:107) + 2 γµ E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) ≤ (1 − γµ ) (cid:107) x k − x ∗ (cid:107) + 2 γµ (cid:18) C α σ + C α δ (cid:19) . (95)Here and thereafter, the expectation is taken over i kw for all regular workers w ∈ R . Using telescopic cancellation on (95)completes the proof. Convergence Property of RS-Byrd-SGD

Theorem 4 shows that the learning error of Byrd-SGD is in the order of O ( C α σ + C α δ ) . Now, we are going to show that thelearning error of RS-Byrd-SGD is in the order of O (( d + − dR ) C sα σ + dC sα δ ) . Theorem 5.

Under Assumptions 1, 2 and 3, if the number of Byzantine workers satisﬁes

B < W s and the step size satisﬁes γ ≤ µ L , then the iterate x k generated by RS-Byrd-SGD satisﬁes E (cid:107) x k − x ∗ (cid:107) ≤ (1 − γµ ) k ∆ + ∆ , (96) where ∆ := (cid:107) x − x ∗ (cid:107) − ∆ , (97) ∆ := 2 µ (cid:18)(cid:18) d + 1 − dR (cid:19) C sα σ + dC sα δ (cid:19) . (98) while α := BW , C sα := − sα − sα and d := W − sW − . On the other hand, the iterate x k generated by RS-Byrd-SGD with (cid:15) -approximate geometric median aggregation satisﬁes E (cid:107) x k − x ∗ (cid:107) ≤ (1 − γµ ) k ∆ + 2∆ + 4 (cid:15) µ ( W − sB ) . (99) Proof.

For simplicity, we only prove (96). Note that inequality (52) is still true for RS-Byrd-SGD with γ < µ L . The onlydifference is that we need to bound the term E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) with Lemma 7, as E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) (100) = γ E (cid:107) geomed (cid:0) { ˜ g kw , w ∈ W} (cid:1) − f (cid:48) ( x k ) (cid:107) ≤ γ (cid:18)(cid:18) d + 1 − dR (cid:19) C sα R (cid:88) w ∈R E (cid:107) f (cid:48) w,i kw ( x k ) − f (cid:48) w ( x k ) (cid:107) + dC sα R (cid:88) w ∈R (cid:107) f (cid:48) w ( x k ) − f (cid:48) ( x k ) (cid:107) (cid:19) ≤ γ (cid:18)(cid:18) d + 1 − dR (cid:19) C sα σ + dC sα δ (cid:19) . Therefore, we have E (cid:107) x k +1 − x ∗ (cid:107) ≤ (1 − γµ ) (cid:107) x k − x ∗ (cid:107) + 2 γµ E (cid:107) x k +1 − x k + γf (cid:48) ( x k ) (cid:107) ≤ (1 − γµ ) (cid:107) x k − x ∗ (cid:107) + 2 γµ (cid:18)(cid:18) d + 1 − dR (cid:19) C sα σ + dC sα δ (cid:19) . (101)Here and thereafter, the expectation is taken over i kw for all regular workers w ∈ R∈ R