Fast and Robust Distributed Learning in High Dimension
FFast and Secure Distributed Learning in High Dimension *†El-Mahdi El-MhamdiEPFLelmahdi.elmhamdi@epfl.ch Rachid GuerraouiEPFLrachid.guerraoui@epfl.ch
Abstract
Modern machine learning is distributed and the work of several machines is typically aggregated by averaging which is the optimal rule in terms of speed, offering a speedup of n (with respect to using asingle machine) when n processes are learning together.Distributing data and models poses however fundamental vulnerabilities, be they to software bugs,asynchrony, or worse, to malicious attackers controlling some machines or injecting misleading data inthe network. Such behavior is best modeled as Byzantine failures, and averaging does not tolerate asingle one from a worker.Krum, the first provably Byzantine resilient aggregation rule for SGD only uses one worker per step,which hampers its speed of convergence, especially in best case conditions when none of the workers isactually Byzantine. An idea, coined multi-Krum, of using m different workers per step was mentioned,without however any proof neither on its Byzantine resilience nor on its slowdown. More recently, it wasshown that in high dimensional machine learning, guaranteeing convergence is not a sufficient conditionfor strong Byzantine resilience. A improvement on Krum, coined Bulyan, was proposed and proved toguarantee stronger resilience. However, Bulyan suffers from the same weakness of Krum: using onlyone worker per step. This adds up to the aforementioned open problem and leaves the crucial need forboth fast and strong Byzantine resilience unfulfilled.The present paper tackles both open problems and proposes using Bulyan over Multi-Krum (we callit
MULTI -B ULYAN ), a combination for which we provide proofs of strong Byzantine resilience, as wellas an mn slowdown, compared to averaging, the fastest (but non Byzantine resilient) rule for distributedmachine learning.Finally, modern machine learning involves data of unprecedentedly high dimension: some modelsare nowadays vectors of dimension d = 10 . In order to deliver results within a reasonable time, learningalgorithms should be at most linear in d and avoid using classic security mechanisms, most of which areat least quadratic in d and hence impractical. A strength of MULTI -B ULYAN is that it inherits the O ( d ) merits of both multi-Krum and Bulyan. * A practical implementation based on this work is described in [3], the code is available in the following Github repository: https://github.com/LPD-EPFL/AggregaThor † Work in progress. a r X i v : . [ c s . D C ] M a y Introduction
The ongoing data deluge has been both a blessing and burden for machine learning system designers. Ablessing since machine learning provably performs better with more training datan [10], and a burden sincethe numbers are beyond previous orders of magnitude. For instance, machine learning set of parameters arenow in the gigabyte [5], training data is several orders of magnitude beyond that [5]. For the latter reason,distributed machine learning is not an option, it the only way to deliver results in a reasonable time for theuser. For instance, Stochastic Gradient Descent (SGD), an algorithm which is the workhorse of today’smachine learning.With the amounts of workload involved in today’s machine learning, distributed systems are the onlyoption to deliver results in a reasonable time.This constraint is even more crucial since ML, given its workload, relies on large scale distributedsystems for which communications costs are additional constraints to local computation costs.We prove the similar mn slowdown of MULTI -B ULYAN and its (strong) Byzantine resilience. We deducethat
MULTI -B ULYAN ensures strong Byzantine resilience and the very fact that it is mn times as fast as theoptimal algorithm (averaging) in the absence of Byzantine workers. MULTI -B ULYAN can be viewed as generalization (also using m different workers per step to leveragethe fact that f , possibly less than a minority can be faulty) of Bulyan , the defense mechanism we presentin [6]. Before presenting in Section 3, our proofs of convergence and slow down of
MULTI -K RUM and inSection 4 our proofs of convergence and slow down of B
ULYAN and hence
MULTI -B ULYAN , we introducein Section 2 a toolbox of formal definitions: weak, strong, and ( α, f ) –Byzantine resilience. We also presenta necessary context on non-convex optimization, as well as its interplay with the high dimensionality ofmachine learning together with the √ d leeway it provides to strong attackers The learning task consists in making accurate predictions for the labels of each data instance ξ i using a highdimentional model (for example, a neural network); we denote the d parameters of that model by the vector x . Each data instance has a set of features (image pixels), and a set of labels (e.g., { cat, person } ). TheCNN is trained with the popular backpropagation algorithm based on SGD. Specifically, SGD addresses thefollowing optimization problem. min x ∈R d Q ( x ) (cid:44) E ξ F ( x ; ξ ) (1)where ξ is a random variable representing a total of B data instances and F ( x ; ξ ) is the loss function. Thefunction Q ( x ) is smooth but not convex.SGD computes the gradient ( G ( x , ξ ) (cid:44) ∇ F ( x ; ξ ) ) and then updates the model parameters ( x ) ina direction opposite to that of the gradient (descent). The vanilla SGD update rule given a sequence oflearning rates { γ k } at any given step is the following: x ( k +1) = x ( k ) − γ k · G ( x ( k ) , ξ )) (2)The popularity of SGD stems from its ability to employ noisy approximations of the actual gradient. Ina distributed setup, SGD employs a mini-batch of b < B training instances for the gradient computation: G ( x , ξ ) = b (cid:88) i =1 G ( x , ξ i ) (3) A step denotes an update in the model parameters. igure 1: Correct workers (black dashed arrows) estimating the real gradient (blue full arrow) whilea Byzantine worker (red dotted) The size of the mini-batch ( b ) induces a trade-off between the robustness of a given update (noise inthe gradient approximation) and the time required to compute this update. The mini-batch also affects theamount of parallelism (Equation 3) that modern computing clusters (multi-GPU etc.) largely benefit from.Scaling the mini-batch size to exploit additional parallelism requires however a non-trivial selection of thesequence of learning rates [7]. A very important assumption for the convergence properties of SGD is thateach gradient is an unbiased estimation of the actual gradient, which is typically ensured through uniformrandom sampling, i.e., gradients that are on expectation equal to the actual gradient (Figure 1). MULTI -B ULYAN relies on two algorithmic components:
MULTI -K RUM [1] and B
ULYAN [6]. The formerrule requires that n ≥ f + 3 and the second requires that n ≥ f + 3 .Intuitively, the goal of MULTI -K RUM is to select the gradients that deviate less from the “majority” basedon their relative distances. Given gradients G . . . G n proposed by workers 1 to n respectively, MULTI -K RUM selects the m gradients with the smallest sum of scores (i.e., L2 norm from the other gradients) asfollows: ( m ) arg min i ∈{ ,...,n } (cid:88) i → j (cid:107) G i − G j (cid:107) (4)where given a function X ( i ) , ( m ) arg min( X ( i )) denotes the indexes i with the m smallest X ( i ) values,and i → j means that G j is among the n − f − closest gradients to G i . B ULYAN in turn takes the afore-mentioned m vectors, computes their coordinate-wise median and produces a gradient which coordinatesare the average of the m − f closest values to the median. Intuitively, weak Byzantine resilience requires a
GAR to guarantee convergence despite the presence of f Byzantine workers. It can be formally stated as follows.
Definition 1 (Weak Byzantine resilience) . We say that a
GAR ensures weak f -Byzantine resilience if thesequence x ( k ) (Equation 2 in the main paper) converges almost surely to some x ∗ where ∇ Q ( x ∗ ) = 0 ,despite the presence of f Byzantine workers.
On the other hand, strong Byzantine resilience requires that this convergence does not lead to ”bad”optimums, and is related to more intricate problem of non-convex optimization, which, in the presence ofByzantine workers, is highly aggravated by the dimension of the problem as explained in what follows.3 igure 2: In a non-convex situation, two correct vectors (black arrows) are pointing towards the deepoptimum located in area B, both vectors belong to the plane formed by lines L1 and L2. A Byzantineworker (magenta) is taking benefit from the third dimension, and the non-convex landscape, to placea vector that is heading towards one of the bad local optimums of area A. This Byzantine vector islocated in the plane (L1,L3). Due to the variance of the correct workers on the plane (L1,L2), theByzantine one has a budget of about √ times the disagreement of the correct workers, to put as adeviation towards A, on the line (L3), while still being selected by a weak Byzantine resilient GAR ,since its projection on the plane (L1,L2) lies exactly on the line (L1), unlike that of the correct workers.In very high dimensions, the situation is amplified by √ d .Specificity of non-convex optimization. Non-convex optimization is one of the earliest established NP-hard problems [8]. In fact, many interesting but hard questions in machine learning boil down to one answer:”because the cost function is not necessarily convex”.In distributed machine learning, the non-convexity of the cost function creates two non-intuitive be-haviours that are important to highlight.(1) A ”mild” Byzantine worker can make the system converge faster. For instance, it has been reportedseveral times in the literature that noise accelerates learning [2, 8]. This can be understood from the ”S”(stochasticity) of SGD: as (correct) workers cannot have a full picture of the surrounding landscape of theloss, they can only draw a sample at random and estimate the best direction based on that sample, whichcan be, and is probably biased compared to the true gradient. Moreover, due to non-convexity, even the truegradient might be leading to the local minima where the parameter server is. By providing a wrong direction(i.e. not the true gradient, or a correct stochastic estimation), a Byzantine worker whose resources cannotface the high-dimensional landscape of the loss, might end up providing a direction to get out of that localminima.(2) Combined with high dimensional issues, non-convexity explains the need for strong Byzantine re-silience. Unlike the ”mild” Byzantine worker, a strong adversary with more resources than the workers andthe server, can see a larger picture and provide an attack that requires a stronger requirement. Namely, arequirement that would cut the √ d leeway offered to an attacker in each dimension. Figure 2 provides anillustration.This motivates the following formalization of strong Byzantine resilience. Definition 2 (Strong Byzantine resilience) . We say that a
GAR ensures strong f -Byzantine resilient if forevery i ∈ [1 , d ] , there exists a correct gradient G (i.e., computed by a non-Byzantine worker) s.t. E | GAR i − G i | = O ( √ d ) . The the expectation is taken over the random samples ( ξ in Equation 1)and v i denotes the th coordinate of a vector v . Weak vs. strong Byzantine resilience.
To attack non-Byzantine resilient
GAR s such as averaging, it onlytakes the computation of an estimate of the gradient, which can be done in O ( n.d ) operations per round by aByzantine worker. This attack is reasonably cheap: within the usual cost of the workload of other workers, O ( d ) , and the server, O ( n.d ) .To attack weakly Byzantine-resilient GAR s however, one needs to find the ’most legitimate but harmfulvector possible’, i.e one that will (1) be selected by a weakly Byzantine-resilient
GAR , and (2) be mislead-ing convergence (red arrow in Figure 1). To find this vector, an attacker has to first collect every correctworker’s vector (before they reach the server), and solve an optimization problem (by linear regression)to approximate this harmful but legitimate vector [6]. If the desired quality of the approximation is (cid:15) , theByzantine worker would need at least Ω( n.d(cid:15) ) operation to reach it with regression. This is a tight lowerbound for a regression problem in d dimensions with n vectors [8]. In practice, if the required precisionis of order − , with workers and a neural network model of dimension , the cost of the attackbecomes quickly prohibitive ( ≈ operations to be done in each step by the attacker).To summarize, weak Byzantine resilience can be enough as a practical solution against attackers whoseresources are comparable to the server’s. However, strong Byzantine resilience remains the only provablesolution against attackers with significant resources.For the sake of our theoretical analysis, we also recall the definition of ( α, f ) –Byzantine resilience [1](Definition 3). This definition is a sufficient condition (as proved in [1] based on [2]) for weak Byzantineresilience.Even-though the property of ( α, f ) –Byzantine resilience is a sufficient, but not a necessary con-dition for (weak) Byzantine resilience, it has been so far used as the defacto standard [1, 4, 11] to guarantee(weak) Byzantine resilience for SGD. We will therefore follow this standard and require ( α, f ) –Byzantineresilience from any GAR that is plugged into
MULTI -B ULYAN , in particular, we will require it from
MULTI -K RUM . The theoretical analysis done in [6] guarantees that B
ULYAN inherits it.Intuitively, Definition 3 states that the gradient aggregation rule
GAR produces an output vector thatlives, on average (over random samples used by SGD), in the cone of angle α around the true gradient. Wesimply call this the ”correct cone”. Definition 3 ( ( α, f ) –Byzantine resilience (as in [1])) . Let ≤ α < π/ be any angular value, and anyinteger ≤ f ≤ n . Let V , . . . , V n be any independent identically distributed random vectors in R d , V i ∼ G , with E G = g . Let B , . . . , B f be any random vectors in R d , possibly dependent on the V i ’s. Anaggregation rule GAR is said to be ( α, f ) -Byzantine resilient if, for any ≤ j < · · · < j f ≤ n , vector GAR = GAR ( V , . . . , B (cid:124)(cid:123)(cid:122)(cid:125) j , . . . , B f (cid:124)(cid:123)(cid:122)(cid:125) j f , . . . , V n ) satisfies (i) (cid:104) E GAR, g (cid:105) ≥ (1 − sin α ) · (cid:107) g (cid:107) > and (ii) for r = 2 , , , E (cid:107) GAR (cid:107) r is bounded above bya linear combination of terms E (cid:107) G (cid:107) r . . . E (cid:107) G (cid:107) r n − with r + · · · + r n − = r . Choice of f . The properties of the existing Byzantine-resilient SGD algorithms all depend on one impor-tant parameter, i.e., the number of potentially
Byzantine nodes f . It is important to notice that f denotesa contract between the designer of the fault-tolerant solution and the user of the solution (who implementsa service on top of the solution and deploys it in a specific setting). As long as the number of Byzantineworkers is less than f , the solution is safe. Fixing an optimal value for f is an orthogonal problem. For Having a scalar product that is lower bounded by this value guarantees that the
GAR of MULTI -K RUM lives in the aformen-tioned cone. For a visualisation of this requirement, see the ball and inner triangle of Figure 3 f = 0 . .n would be a suggested choice to tune thealgorithm, and suffer from only a 99% slowdown.The performance (convergence time) of certain existing Byzantine-resilient SGD algorithms in a non-Byzantine environment is independent of the choice of f . These algorithms do not exploit the full potentialof the choice of f . Modern large-scale systems are versatile and often undergo important structural changeswhile providing online services (e.g., addition or maintenance of certain worker nodes). Intuitively, thereshould be a fine granularity between the level of pessimism (i.e., value of f ) and the performance of theSGD algorithm in the setting with no Byzantine failures. MULTI -K RUM : Weak Byzantine Resilience and Slowdown
Let n be any integer greater than , f any integer s.t f ≤ n − and m an integer s.t m ≤ n − f − . Let ˜ m = n − f − .We first prove the ( α, f ) –Byzantine resilience of MULTI -K RUM (Lemma 1), then prove its almost sureconvergence (Lemma 2) based on that, which proves the weak Byzantine resilience of
MULTI -K RUM (The-orem 1).In all what follows, expectations are taken over random samples used by correct workers to estimate thegradient, i.e the ”S” (stochasticity) that is inherent to SGD. It is worth noting that this analysis in expecta-tion is not an average case analysis from the point of view of Byzantine fault tolerance. For instance, theByzantine worker is always assumed to follow arbitrarily bad policies and the analysis is a worst-case one.The Byzantine resilience proof (Lemma 1) relies on the following observation: given m ≤ n − f − ,and in particular m = n − f − , m -Krum averages m gradients that are all in the ”correct cone”, anda cone is a convex set, thus stable by averaging. The resulting vectors therefore also live in that cone. Theangle of the cone will depend on a variable η ( n.f ) as in [1], the value of η ( n.f ) itself depends on m . This iswhat enables us to use multi-Krum as the basis of our MULTI -K RUM , unlike [1] where a restriction is madeon m = 1 .The proof of Lemma 2 is the same as the one in [1] which itself draws on the rather classic analysis ofSGD made by L.Bottou [2]. The key concepts are (1) a global confinement of the sequence of parametervectors and (2) a bound on the statistical moments of the random sequence of estimators built by the GAR of MULTI -K RUM . As in [1, 2], reasonable assumptions are made on the cost function Q , those assumptionare not restrictive and are common in practical machine learning. Theorem 1 (Byzantine resilience and slowdown of
MULTI -K RUM ) . Let m be any integer s.t. m ≤ n − f − . (i) MULTI -K RUM has weak Byzantine resilience against f failures. (ii) In the absence of Byzantine workers,
MULTI -K RUM has a slowdown (expressed in ratio with averaging) of Ω( ˜ mn ) .Proof. Proof of (i).
To prove (i) , we will require Lemma 1 and Lemma 2, then conclude by construction of
MULTI -K RUM as a multi-Krum algorithm with m = n − f − . Lemma 1.
Let V , . . . , V n be any independent and identically distributed random d -dimensional vectorss.t V i ∼ G , with E G = g and E (cid:107) G − g (cid:107) = dσ . Let B , . . . , B f be any f random vectors, possiblydependent on the V i ’s. If f + 2 < n and η ( n, f ) √ d · σ < (cid:107) g (cid:107) , where η ( n, f ) = def (cid:115) (cid:18) n − f + f · m + f · ( m + 1) m (cid:19) , The slowdown question is an incentive to take the highest value of m among those that satisfy Byzantine resilience, in this case ˜ m . hen the GAR function of
MULTI -K RUM is ( α, f ) -Byzantine resilient where ≤ α < π/ is defined by sin α = η ( n, f ) · √ d · σ (cid:107) g (cid:107) . Proof.
Without loss of generality, we assume that the Byzantine vectors B , . . . , B f occupy the last f posi-tions in the list of arguments of MULTI -K RUM , i.e.,
MULTI -K RUM = MULTI -K RUM ( V , . . . , V n − f , B , . . . , B f ) .An index is correct if it refers to a vector among V , . . . , V n − f . An index is Byzantine if it refers to a vectoramong B , . . . , B f . For each index (correct or Byzantine) i , we denote by δ c ( i ) (resp. δ b ( i ) ) the number ofcorrect (resp. Byzantine) indices j such that i → j (the notation we introduced in Section 3 when defining MULTI -K RUM ), i.e the number of workers, among the m neighbors of i that are correct (resp. Byzantine).We have δ c ( i )+ δ b ( i ) = mn − f − ≤ δ c ( i ) ≤ mδ b ( i ) ≤ f. We focus first on the condition (i) of ( α, f ) -Byzantine resilience. We determine an upper bound on thesquared distance (cid:107) E MULTI -K RUM − g (cid:107) . Note that, for any correct j , E V j = g . We denote by i ∗ the indexof the worst scoring among the m vectors chosen by the MULTI -K RUM function, i.e one that ranks with the m th smallest score in Equation 5 of the main paper (Section 3). (cid:107) E MULTI -K RUM − g (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) E MULTI -K RUM − δ c ( i ∗ ) (cid:88) i ∗ → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) MULTI -K RUM − δ c ( i ∗ ) (cid:88) i ∗ → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (Jensen inequality) ≤ (cid:88) correct i E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V i − δ c ( i ) (cid:88) i → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) I ( i ∗ = i )+ (cid:88) byz k E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B k − δ c ( k ) (cid:88) k → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) I ( i ∗ = k ) where I denotes the indicator function . We examine the case i ∗ = i for some correct index i . (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V i − δ c ( i ) (cid:88) i → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) δ c ( i ) (cid:88) i → correct j V i − V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ c ( i ) (cid:88) i → correct j (cid:107) V i − V j (cid:107) (Jensen inequality) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) V i − δ c ( i ) (cid:88) i → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ c ( i ) (cid:88) i → correct j E (cid:107) V i − V j (cid:107) I ( P ) equals if the predicate P is true, and otherwise. dσ . We now examine the case i ∗ = k for some Byzantine index k . The fact that k minimizes the score impliesthat for all correct indices i (cid:88) k → correct j (cid:107) B k − V j (cid:107) + (cid:88) k → byz l (cid:107) B k − B l (cid:107) ≤ (cid:88) i → correct j (cid:107) V i − V j (cid:107) + (cid:88) i → byz l (cid:107) V i − B l (cid:107) . Then, for all correct indices i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B k − δ c ( k ) (cid:88) k → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ c ( k ) (cid:88) k → correct j (cid:107) B k − V j (cid:107) ≤ δ c ( k ) (cid:88) i → correct j (cid:107) V i − V j (cid:107) + 1 δ c ( k ) (cid:88) i → byz l (cid:107) V i − B l (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) D ( i ) . We focus on the term D ( i ) . Each correct process i has m neighbors, and f + 1 non-neighbors. Thus thereexists a correct worker ζ ( i ) which is farther from i than any of the neighbors of i . In particular, for eachByzantine index l such that i → l , (cid:107) V i − B l (cid:107) ≤ (cid:13)(cid:13) V i − V ζ ( i ) (cid:13)(cid:13) . Whence (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B k − δ c ( k ) (cid:88) k → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ c ( k ) (cid:88) i → correct j (cid:107) V i − V j (cid:107) + δ b ( i ) δ c ( k ) (cid:13)(cid:13) V i − V ζ ( i ) (cid:13)(cid:13) E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B k − δ c ( k ) (cid:88) k → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ δ c ( i ) δ c ( k ) · dσ + δ b ( i ) δ c ( k ) (cid:88) correct j (cid:54) = i E (cid:107) V i − V j (cid:107) I ( ζ ( i ) = j ) ≤ (cid:18) δ c ( i ) δ c ( k ) · + δ b ( i ) δ c ( k ) ( m + 1) (cid:19) dσ ≤ (cid:18) mn − f − fn − f − · ( m + 1) (cid:19) dσ . Putting everything back together, we obtain (cid:107) E MULTI -K RUM − g (cid:107) ≤ ( n − f )2 dσ + f · (cid:18) mn − f − fn − f − · ( m + 1) (cid:19) dσ ≤ (cid:18) n − f + f · m + f · ( m + 1) n − f − (cid:19)(cid:124) (cid:123)(cid:122) (cid:125) η ( n,f ) dσ . By assumption, η ( n, f ) √ dσ < (cid:107) g (cid:107) , i.e., E MULTI -K RUM belongs to a ball centered at g with radius η ( n, f ) ·√ d · σ . This implies (cid:104) E MULTI -K RUM , g (cid:105) ≥ (cid:16) (cid:107) g (cid:107) − η ( n, f ) · √ d · σ (cid:17) · (cid:107) g (cid:107) = (1 − sin α ) · (cid:107) g (cid:107) . To sum up, condition (i) of the ( α, f ) -Byzantine resilience property holds. We now focus on condition (ii) . E (cid:107) MULTI -K RUM (cid:107) r = (cid:88) correct i E (cid:107) V i (cid:107) r I ( i ∗ = i ) + (cid:88) byz k E (cid:107) B k (cid:107) r I ( i ∗ = k ) ( n − f ) E (cid:107) G (cid:107) r + (cid:88) byz k E (cid:107) B k (cid:107) r I ( i ∗ = k ) . Denoting by C a generic constant, when i ∗ = k , we have for all correct indices i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B k − δ c ( k ) (cid:88) k → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:118)(cid:117)(cid:117)(cid:116) δ c ( k ) (cid:88) i → correct j (cid:107) V i − V j (cid:107) + δ b ( i ) δ c ( k ) (cid:13)(cid:13) V i − V ζ ( i ) (cid:13)(cid:13) ≤ C · (cid:115) δ c ( k ) · (cid:88) i → correct j (cid:107) V i − V j (cid:107) + (cid:115) δ b ( i ) δ c ( k ) · (cid:13)(cid:13) V i − V ζ ( i ) (cid:13)(cid:13) ≤ C · (cid:88) correct j (cid:107) V j (cid:107) (triangular inequality) . The second inequality comes from the equivalence of norms in finite dimension. Now (cid:107) B k (cid:107) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) B k − δ c ( k ) (cid:88) k → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) δ c ( k ) (cid:88) k → correct j V j (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C · (cid:88) correct j (cid:107) V j (cid:107)(cid:107) B k (cid:107) r ≤ C · (cid:88) r + ··· + r n − f = r (cid:107) V (cid:107) r · · · (cid:107) V n − f (cid:107) r n − f . Since the V i ’s are independent, we finally obtain that E (cid:107) MULTI -K RUM (cid:107) r is bounded above by a linearcombination of terms of the form E (cid:107) V (cid:107) r · · · E (cid:107) V n − f (cid:107) r n − f = E (cid:107) G (cid:107) r · · · E (cid:107) G (cid:107) r n − f with r + · · · + r n − f = r . This completes the proof of condition (ii) . Lemma 2.
Assume that (i) the cost function Q is three times differentiable with continuous derivatives, andis non-negative, Q ( x ) ≥ ; (ii) the learning rates satisfy (cid:80) t γ t = ∞ and (cid:80) t γ t < ∞ ; (iii) the gradientestimator satisfies E G ( x, ξ ) = ∇ Q ( x ) and ∀ r ∈ { , . . . , } , E (cid:107) G ( x, ξ ) (cid:107) r ≤ A r + B r (cid:107) x (cid:107) r for someconstants A r , B r ; (iv) there exists a constant ≤ α < π/ such that for all xη ( n, f ) · √ d · σ ( x ) ≤ (cid:107)∇ Q ( x ) (cid:107) · sin α ; (v) finally, beyond a certain horizon, (cid:107) x (cid:107) ≥ D , there exist (cid:15) > and ≤ β < π/ − α such that (cid:107)∇ Q ( x ) (cid:107) ≥ (cid:15) > (cid:104) x, ∇ Q ( x ) (cid:105)(cid:107) x (cid:107) · (cid:107)∇ Q ( x ) (cid:107) ≥ cos β. Then the sequence of gradients ∇ Q ( x t ) converges almost surely to zero.Proof. For the sake of simplicity, we write
MULTI -K RUM t = MULTI -K RUM ( V t , . . . , V tn ) . Before provingthe main claim of the proposition, we first show that the sequence x t is almost surely globally confinedwithin the region (cid:107) x (cid:107) ≤ D . 9 Global confinement).
Let u t = φ ( (cid:107) x t (cid:107) ) where φ ( a ) = (cid:26) if a < D ( a − D ) otherwiseNote that φ ( b ) − φ ( a ) ≤ ( b − a ) φ (cid:48) ( a ) + ( b − a ) . (5)This becomes an equality when a, b ≥ D . Applying this inequality to u t +1 − u t yields u t +1 − u t ≤ (cid:0) − γ t (cid:104) x t , MULTI -K RUM t (cid:105) + γ t (cid:107) MULTI -K RUM t (cid:107) (cid:1) · φ (cid:48) ( (cid:107) x t (cid:107) )+ 4 γ t (cid:104) x t , MULTI -K RUM t (cid:105) − γ t (cid:104) x t , MULTI -K RUM t (cid:105)(cid:107) MULTI -K RUM t (cid:107) + γ t (cid:107) MULTI -K RUM t (cid:107) ≤ − γ t (cid:104) x t , MULTI -K RUM t (cid:105) φ (cid:48) ( (cid:107) x t (cid:107) ) + γ t (cid:107) MULTI -K RUM t (cid:107) φ (cid:48) ( (cid:107) x t (cid:107) )+ 4 γ t (cid:107) x t (cid:107) (cid:107) MULTI -K RUM t (cid:107) + 4 γ t (cid:107) x t (cid:107)(cid:107) MULTI -K RUM t (cid:107) + γ t (cid:107) MULTI -K RUM t (cid:107) . Let P t denote the σ -algebra encoding all the information up to round t . Taking the conditional expectationwith respect to P t yields E ( u t +1 − u t |P t ) ≤ − γ t (cid:104) x t , E MULTI -K RUM t (cid:105) + γ t E (cid:0) (cid:107) MULTI -K RUM t (cid:107) (cid:1) φ (cid:48) ( (cid:107) x t (cid:107) )+ 4 γ t (cid:107) x t (cid:107) E (cid:0) (cid:107) MULTI -K RUM t (cid:107) (cid:1) + 4 γ t (cid:107) x t (cid:107) E (cid:0) (cid:107) MULTI -K RUM t (cid:107) (cid:1) + γ t E (cid:0) (cid:107) MULTI -K RUM t (cid:107) (cid:1) . Thanks to condition (ii) of ( α, f ) -Byzantine resilience, and the assumption on the first four moments of G ,there exist positive constants A , B such that E ( u t +1 − u t |P t ) ≤ − γ t (cid:104) x t , E MULTI -K RUM t (cid:105) φ (cid:48) ( (cid:107) x t (cid:107) ) + γ t (cid:0) A + B (cid:107) x t (cid:107) (cid:1) . Thus, there exist positive constant
A, B such that E ( u t +1 − u t |P t ) ≤ − γ t (cid:104) x t , E MULTI -K RUM t (cid:105) φ (cid:48) ( (cid:107) x t (cid:107) ) + γ t ( A + B · u t ) . When (cid:107) x t (cid:107) < D , the first term of the right hand side is null because φ (cid:48) ( (cid:107) x t (cid:107) ) = 0 . When (cid:107) x t (cid:107) ≥ D ,this first term is negative because (see Figure 3) (cid:104) x t , E MULTI -K RUM t (cid:105) ≥ (cid:107) x t (cid:107) · (cid:107) E MULTI -K RUM t (cid:107) · cos( α + β ) > . Hence E ( u t +1 − u t |P t ) ≤ γ t ( A + B · u t ) . We define two auxiliary sequences µ t = t (cid:89) i =1 − γ i B −−−→ t →∞ µ ∞ u (cid:48) t = µ t u t . Note that the sequence µ t converges because (cid:80) t γ t < ∞ . Then E (cid:0) u (cid:48) t +1 − u (cid:48) t |P t (cid:1) ≤ γ t µ t A. Consider the indicator of the positive variations of the left-hand side χ t = (cid:26) if E (cid:0) u (cid:48) t +1 − u (cid:48) t |P t (cid:1) > otherwise10hen E (cid:0) χ t · ( u (cid:48) t +1 − u (cid:48) t ) (cid:1) ≤ E (cid:0) χ t · E (cid:0) u (cid:48) t +1 − u (cid:48) t |P t (cid:1)(cid:1) ≤ γ t µ t A. The right-hand side of the previous inequality is the summand of a convergent series. By the quasi-martingale convergence theorem [9], this shows that the sequence u (cid:48) t converges almost surely, which inturn shows that the sequence u t converges almost surely, u t → u ∞ ≥ .Let us assume that u ∞ > . When t is large enough, this implies that (cid:107) x t (cid:107) and (cid:107) x t +1 (cid:107) are greaterthan D . Inequality 5 becomes an equality, which implies that the following infinite sum converges almostsurely ∞ (cid:88) t =1 γ t (cid:104) x t , E MULTI -K RUM t (cid:105) φ (cid:48) ( (cid:107) x t (cid:107) ) < ∞ . Note that the sequence φ (cid:48) ( (cid:107) x t (cid:107) ) converges to a positive value. In the region (cid:107) x t (cid:107) > D , we have (cid:104) x t , E MULTI -K RUM t (cid:105) ≥ √ D · (cid:107) E MULTI -K RUM t (cid:107) · cos( α + β ) ≥ √ D · (cid:16) (cid:107)∇ Q ( x t ) (cid:107) − η ( n, f ) · √ d · σ ( x t ) (cid:17) · cos( α + β ) ≥ √ D · (cid:15) · (1 − sin α ) · cos( α + β ) > . This contradicts the fact that (cid:80) ∞ t =1 γ t = ∞ . Therefore, the sequence u t converges to zero. This con-vergence implies that the sequence (cid:107) x t (cid:107) is bounded, i.e., the vector x t is confined in a bounded regioncontaining the origin. As a consequence, any continuous function of x t is also bounded, such as, e.g., (cid:107) x t (cid:107) , E (cid:107) G ( x t , ξ ) (cid:107) and all the derivatives of the cost function Q ( x t ) . In the sequel, positive constants K , K , etc. . . are introduced whenever such a bound is used. (Convergence). We proceed to show that the gradient ∇ Q ( x t ) converges almost surely to zero. We define h t = Q ( x t ) . Using a first-order Taylor expansion and bounding the second derivative with K , we obtain | h t +1 − h t + 2 γ t (cid:104) MULTI -K RUM t , ∇ Q ( x t ) (cid:105)| ≤ γ t (cid:107) MULTI -K RUM t (cid:107) K a.s.Therefore E ( h t +1 − h t |P t ) ≤ − γ t (cid:104) E MULTI -K RUM t , ∇ Q ( x t ) (cid:105) + γ t E (cid:0) (cid:107) MULTI -K RUM t (cid:107) |P t (cid:1) K . (6)By the properties of ( α, f ) -Byzantine resiliency, this implies E ( h t +1 − h t |P t ) ≤ γ t K K , which in turn implies that the positive variations of h t are also bounded E ( χ t · ( h t +1 − h t )) ≤ γ t K K . The right-hand side is the summand of a convergent infinite sum. By the quasi-martingale convergencetheorem, the sequence h t converges almost surely, Q ( x t ) → Q ∞ .Taking the expectation of Inequality 6, and summing on t = 1 , . . . , ∞ , the convergence of Q ( x t ) impliesthat ∞ (cid:88) t =1 γ t (cid:104) E MULTI -K RUM t , ∇ Q ( x t ) (cid:105) < ∞ a.s.11 η √ dσαβ ∇ Q ( x t ) x t Figure 3: Condition on the angles between x t , ∇ Q ( x t ) and the the GAR of MULTI -K RUM vector E MULTI -K RUM t , in the region (cid:107) x t (cid:107) > D . We now define ρ t = (cid:107)∇ Q ( x t ) (cid:107) . Using a Taylor expansion, as demonstrated for the variations of h t , we obtain ρ t +1 − ρ t ≤ − γ t (cid:104) MULTI -K RUM t , (cid:0) ∇ Q ( x t ) (cid:1) · ∇ Q ( x t ) (cid:105) + γ t (cid:107) MULTI -K RUM t (cid:107) K a.s.Taking the conditional expectation, and bounding the second derivatives by K , E ( ρ t +1 − ρ t |P t ) ≤ γ t (cid:104) E MULTI -K RUM t , ∇ Q ( x t ) (cid:105) K + γ t K K . The positive expected variations of ρ t are bounded E ( χ t · ( ρ t +1 − ρ t )) ≤ γ t E (cid:104) E MULTI -K RUM t , ∇ Q ( x t ) (cid:105) K + γ t K K . The two terms on the right-hand side are the summands of convergent infinite series. By the quasi-martingaleconvergence theorem, this shows that ρ t converges almost surely.We have (cid:104) E MULTI -K RUM t , ∇ Q ( x t ) (cid:105) ≥ (cid:16) (cid:107)∇ Q ( x t ) (cid:107) − η ( n, f ) · √ d · σ ( x t ) (cid:17) · (cid:107)∇ Q ( x t ) (cid:107)≥ (1 − sin α ) (cid:124) (cid:123)(cid:122) (cid:125) > · ρ t . This implies that the following infinite series converge almost surely ∞ (cid:88) t =1 γ t · ρ t < ∞ . Since ρ t converges almost surely, and the series (cid:80) ∞ t =1 γ t = ∞ diverges, we conclude that the sequence (cid:107)∇ Q ( x t ) (cid:107) converges almost surely to zero.We conclude the proof of (i) by recalling the definition of MULTI -K RUM , as the instance of m − Krum with m = n − f − . Proof of (ii). (ii) is a consequence of the fact that m -Krum is the average of m estimators of the gradient.In the absence of Byzantine workers, all those estimators will not only be from the ”correct cone”, but fromcorrect workers (Byzantine workers can also be in the correct cone, but in this case there are none). AsSGD converges in O ( m ) , where m is the number of used estimators of the gradient, the slowdown resultfollows. 12 MULTI -B ULYAN : Strong Byzantine Resilience and Slowdown
Let n be any integer greater than , f any integer s.t f ≤ n − and m an integer s.t m ≤ n − f − . Let ˜ m = n − f − . Theorem 2 (Byzantine resilience and slowdown of
MULTI -B ULYAN ) . (i) MULTI -B ULYAN provides strongByzantine resilience against f failures. (ii) In the absence of Byzantine workers, MULTI -B ULYAN has aslowdown (expressed in ratio with averaging) of Ω( ˜ mn ) .Proof. If the number of iterations over
MULTI -K RUM is n − f , then the leeway, defined by the coordinate-wise distance between the output of B ULYAN and a correct gradient is upper bounded by O ( √ d ) . This is dueto the fact that B ULYAN relies on a component-wise median, that, as proven in [6] guarantees this bound.The proof is then a direct consequence of Theorem 1 and the properties of Bulyan [6].
References [1] B
LANCHARD , P., E L M HAMDI , E. M., G
UERRAOUI , R.,
AND S TAINER , J. Machine learning withadversaries: Byzantine tolerant gradient descent. In
NIPS (2017), pp. 118–128.[2] B
OTTOU , L. Online learning and stochastic approximations.
Online learning in neural networks 17 ,9 (1998), 142.[3] D
AMASKINOS , G., E L -M HAMDI , E.-M., G
UERRAOUI , R., G
UIRGUIS , A.,
AND R OUAULT , S. Ag-gregathor: : Byzantine machine learning via robust gradient aggregation. In
Proceedings of the st SysML Conference (2019).[4] D
AMASKINOS , G., E L M HAMDI , E. M., G
UERRAOUI , R., P
ATRA , R., T
AZIKI , M.,
ET AL . Asyn-chronous Byzantine machine learning (the case of sgd). In
Proceedings of the 35th InternationalConference on Machine Learning (Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018), vol. 80of
Proceedings of Machine Learning Research , PMLR, pp. 1153–1162.[5] D
EAN , J., C
ORRADO , G., M
ONGA , R., C
HEN , K., D
EVIN , M., M AO , M., S ENIOR , A., T
UCKER ,P., Y
ANG , K., L E , Q. V., ET AL . Large scale distributed deep networks. In
NIPS (2012), pp. 1223–1231.[6] E L M HAMDI , E. M., G
UERRAOUI , R.,
AND R OUAULT , S. The hidden vulnerability of distributedlearning in Byzantium. In
Proceedings of the 35th International Conference on Machine Learning (Stockholmsm¨assan, Stockholm Sweden, 10–15 Jul 2018), vol. 80 of
Proceedings of Machine Learn-ing Research , PMLR, pp. 3521–3530.[7] G
OYAL , P., D
OLL ´ AR , P., G IRSHICK , R., N
OORDHUIS , P., W
ESOLOWSKI , L., K
YROLA , A., T UL - LOCH , A., J IA , Y., AND H E , K. Accurate, large minibatch sgd: training imagenet in 1 hour. arXivpreprint arXiv:1706.02677 (2017).[8] H AYKIN , S. S.
Neural networks and learning machines , vol. 3. Pearson Upper Saddle River, NJ,USA:, 2009.[9] M ´
ETIVIER , M.
Semi-Martingales . Walter de Gruyter, 1983.[10] S
HALEV -S HWARTZ , S.,
AND B EN -D AVID , S.
Understanding machine learning: From theory toalgorithms . Cambridge university press, 2014. 1311] X IE , C., K OYEJO , O.,
AND G UPTA , I. Generalized Byzantine-tolerant sgd. arXiv preprintarXiv:1802.10116arXiv preprintarXiv:1802.10116